Discussion:
[PATCH] iommu: arm-smmu: Set SMTNMB_TLBEN in ACR to enable caching of bypass entries
Nipun Gupta
2016-11-02 13:35:14 UTC
Permalink
The SMTNMB_TLBEN in the Auxiliary Configuration Register (ACR) provides an
option to enable the updation of TLB in case of bypass transactions due to
no stream match in the stream match table. This reduces the latencies of
the subsequent transactions with the same stream-id which bypasses the SMMU.
This provides a significant performance benefit for certain networking
workloads.

Signed-off-by: Nipun Gupta <***@nxp.com>
---
drivers/iommu/arm-smmu.c | 21 +++++++++++++++------
1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index ce2a9d4..7010a5c 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -246,6 +246,7 @@ enum arm_smmu_s2cr_privcfg {

#define ARM_MMU500_ACTLR_CPRE (1 << 1)

+#define ACR_SMTNMB_TLBEN (1 << 8)
#define ARM_MMU500_ACR_CACHE_LOCK (1 << 26)

#define CB_PAR_F (1 << 0)
@@ -1569,18 +1570,26 @@ static void arm_smmu_device_reset(struct arm_smmu_device *smmu)
for (i = 0; i < smmu->num_mapping_groups; ++i)
arm_smmu_write_sme(smmu, i);

+ /* Get the major rev required for configuring ACR */
+ reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_ID7);
+ major = (reg >> ID7_MAJOR_SHIFT) & ID7_MAJOR_MASK;
+
/*
* Before clearing ARM_MMU500_ACTLR_CPRE, need to
* clear CACHE_LOCK bit of ACR first. And, CACHE_LOCK
* bit is only present in MMU-500r2 onwards.
*/
- reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_ID7);
- major = (reg >> ID7_MAJOR_SHIFT) & ID7_MAJOR_MASK;
- if ((smmu->model == ARM_MMU500) && (major >= 2)) {
- reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_sACR);
+ reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_sACR);
+ if ((smmu->model == ARM_MMU500) && (major >= 2))
reg &= ~ARM_MMU500_ACR_CACHE_LOCK;
- writel_relaxed(reg, gr0_base + ARM_SMMU_GR0_sACR);
- }
+
+ /*
+ * Set the SMTNMB_TLBEN in ACR so that the transactions which
+ * bypass with SMMU due to no stream match found in the SMR table
+ * are updated in the TLB's.
+ */
+ reg |= ACR_SMTNMB_TLBEN;
+ writel_relaxed(reg, gr0_base + ARM_SMMU_GR0_sACR);

/* Make sure all context banks are disabled and clear CB_FSR */
for (i = 0; i < smmu->num_context_banks; ++i) {
--
1.9.1
Robin Murphy
2016-11-02 11:21:23 UTC
Permalink
Hi Nipun,
Post by Nipun Gupta
The SMTNMB_TLBEN in the Auxiliary Configuration Register (ACR) provides an
option to enable the updation of TLB in case of bypass transactions due to
no stream match in the stream match table. This reduces the latencies of
the subsequent transactions with the same stream-id which bypasses the SMMU.
This provides a significant performance benefit for certain networking
workloads.
...at the cost of increased TLB contention against other workloads.
However, in the general case we'd expect the system to be fully
described, so if there aren't any unmatched Stream IDs there hopefully
shouldn't be an impact to leaving this switched on. I'd be interested to
see some actual performance numbers, though - you already know my
opinion about unsubstantiated quotes from the MMU-500 TRM.
Post by Nipun Gupta
---
drivers/iommu/arm-smmu.c | 21 +++++++++++++++------
1 file changed, 15 insertions(+), 6 deletions(-)
diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index ce2a9d4..7010a5c 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -246,6 +246,7 @@ enum arm_smmu_s2cr_privcfg {
#define ARM_MMU500_ACTLR_CPRE (1 << 1)
+#define ACR_SMTNMB_TLBEN (1 << 8)
ACR is entirely implementation-defined, so there are no generic field
names. Please follow the naming convention handily demonstrated in the
subsequent context line.
Post by Nipun Gupta
#define ARM_MMU500_ACR_CACHE_LOCK (1 << 26)
Actually, can we also please keep these in descending order of bit
position like everything else?
Post by Nipun Gupta
#define CB_PAR_F (1 << 0)
@@ -1569,18 +1570,26 @@ static void arm_smmu_device_reset(struct arm_smmu_device *smmu)
for (i = 0; i < smmu->num_mapping_groups; ++i)
arm_smmu_write_sme(smmu, i);
+ /* Get the major rev required for configuring ACR */
That comment is nonsense.
Post by Nipun Gupta
+ reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_ID7);
+ major = (reg >> ID7_MAJOR_SHIFT) & ID7_MAJOR_MASK;
+
/*
* Before clearing ARM_MMU500_ACTLR_CPRE, need to
* clear CACHE_LOCK bit of ACR first. And, CACHE_LOCK
* bit is only present in MMU-500r2 onwards.
*/
- reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_ID7);
- major = (reg >> ID7_MAJOR_SHIFT) & ID7_MAJOR_MASK;
- if ((smmu->model == ARM_MMU500) && (major >= 2)) {
- reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_sACR);
+ reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_sACR);
+ if ((smmu->model == ARM_MMU500) && (major >= 2))
reg &= ~ARM_MMU500_ACR_CACHE_LOCK;
- writel_relaxed(reg, gr0_base + ARM_SMMU_GR0_sACR);
- }
+
+ /*
+ * Set the SMTNMB_TLBEN in ACR so that the transactions which
+ * bypass with SMMU due to no stream match found in the SMR table
+ * are updated in the TLB's.
Or simply, e.g. "Allow unmatched Stream IDs to allocate bypass TLB
entries for reduced latency". It's already clear from the code what
bit's being set where, we only need to remember *why*.
Post by Nipun Gupta
+ */
+ reg |= ACR_SMTNMB_TLBEN;
+ writel_relaxed(reg, gr0_base + ARM_SMMU_GR0_sACR);
Are you sure it's perfectly fine to set that implementation-defined bit
on any SMMU implementation other than the two-and-a-half ARM Ltd. ones
which happen to share the same meaning? I'm certainly not.

The correct flow would be something like this:

if (smmu->model == ARM_MMU500) {
reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_ID7);
major = (reg >> ID7_MAJOR_SHIFT) & ID7_MAJOR_MASK;
reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_sACR);
if (major >= 2)
reg &= ~ARM_MMU500_ACR_CACHE_LOCK;
reg |= ACR_SMTNMB_TLBEN;
writel_relaxed(reg, gr0_base + ARM_SMMU_GR0_sACR);
}

The shape of the current code avoids an extra level of indentation, but
once you have to have the nested conditional anyway, it might as well
all be predicated appropriately.

Robin.
Post by Nipun Gupta
/* Make sure all context banks are disabled and clear CB_FSR */
for (i = 0; i < smmu->num_context_banks; ++i) {
Nipun Gupta
2016-11-02 19:26:02 UTC
Permalink
Hi Robin,
-----Original Message-----
Sent: Wednesday, November 02, 2016 16:51
Subject: Re: [PATCH] iommu: arm-smmu: Set SMTNMB_TLBEN in ACR to enable
caching of bypass entries
Hi Nipun,
Post by Nipun Gupta
The SMTNMB_TLBEN in the Auxiliary Configuration Register (ACR)
provides an option to enable the updation of TLB in case of bypass
transactions due to no stream match in the stream match table. This
reduces the latencies of the subsequent transactions with the same stream-id
which bypasses the SMMU.
Post by Nipun Gupta
This provides a significant performance benefit for certain networking
workloads.
...at the cost of increased TLB contention against other workloads.
However, in the general case we'd expect the system to be fully described, so if
there aren't any unmatched Stream IDs there hopefully shouldn't be an impact
to leaving this switched on. I'd be interested to see some actual performance
numbers, though - you already know my opinion about unsubstantiated quotes
from the MMU-500 TRM.
With this change we have seen substantial performance improvement of ~9-10%
with DPDK l3fwd application (http://dpdk.org/doc/guides/sample_app_ug/l3_forward.html)
on NXP's LS2088a platform (single core as well as multi-core). Also, with ODP reflector application
(loopback mode - NXP in-house) we have seen 5% improvement in performance on
LS1088 platform.

W.r.t. the read latencies, they are reduced to avg. ~50 platform cycles from avg. ~140
platform cycles per memory read transactions which follow this bypass path (on LS2088
with DPDK l3fwd application).

(Apologies, I cannot share the DPDK/ODP exact performance numbers on the mailing list).
Post by Nipun Gupta
---
drivers/iommu/arm-smmu.c | 21 +++++++++++++++------
1 file changed, 15 insertions(+), 6 deletions(-)
diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c index
ce2a9d4..7010a5c 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -246,6 +246,7 @@ enum arm_smmu_s2cr_privcfg {
#define ARM_MMU500_ACTLR_CPRE (1 << 1)
+#define ACR_SMTNMB_TLBEN (1 << 8)
ACR is entirely implementation-defined, so there are no generic field names.
Please follow the naming convention handily demonstrated in the subsequent
context line.
Post by Nipun Gupta
#define ARM_MMU500_ACR_CACHE_LOCK (1 << 26)
Actually, can we also please keep these in descending order of bit position like
everything else?
Post by Nipun Gupta
#define CB_PAR_F (1 << 0)
@@ -1569,18 +1570,26 @@ static void arm_smmu_device_reset(struct
arm_smmu_device *smmu)
Post by Nipun Gupta
for (i = 0; i < smmu->num_mapping_groups; ++i)
arm_smmu_write_sme(smmu, i);
+ /* Get the major rev required for configuring ACR */
That comment is nonsense.
Post by Nipun Gupta
+ reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_ID7);
+ major = (reg >> ID7_MAJOR_SHIFT) & ID7_MAJOR_MASK;
+
/*
* Before clearing ARM_MMU500_ACTLR_CPRE, need to
* clear CACHE_LOCK bit of ACR first. And, CACHE_LOCK
* bit is only present in MMU-500r2 onwards.
*/
- reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_ID7);
- major = (reg >> ID7_MAJOR_SHIFT) & ID7_MAJOR_MASK;
- if ((smmu->model == ARM_MMU500) && (major >= 2)) {
- reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_sACR);
+ reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_sACR);
+ if ((smmu->model == ARM_MMU500) && (major >= 2))
reg &= ~ARM_MMU500_ACR_CACHE_LOCK;
- writel_relaxed(reg, gr0_base + ARM_SMMU_GR0_sACR);
- }
+
+ /*
+ * Set the SMTNMB_TLBEN in ACR so that the transactions which
+ * bypass with SMMU due to no stream match found in the SMR table
+ * are updated in the TLB's.
Or simply, e.g. "Allow unmatched Stream IDs to allocate bypass TLB entries for
reduced latency". It's already clear from the code what bit's being set where, we
only need to remember *why*.
Post by Nipun Gupta
+ */
+ reg |= ACR_SMTNMB_TLBEN;
+ writel_relaxed(reg, gr0_base + ARM_SMMU_GR0_sACR);
Are you sure it's perfectly fine to set that implementation-defined bit on any
SMMU implementation other than the two-and-a-half ARM Ltd. ones which
happen to share the same meaning? I'm certainly not.
if (smmu->model == ARM_MMU500) {
reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_ID7);
major = (reg >> ID7_MAJOR_SHIFT) & ID7_MAJOR_MASK;
reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_sACR);
if (major >= 2)
reg &= ~ARM_MMU500_ACR_CACHE_LOCK;
reg |= ACR_SMTNMB_TLBEN;
writel_relaxed(reg, gr0_base + ARM_SMMU_GR0_sACR);
}
The shape of the current code avoids an extra level of indentation, but once you
have to have the nested conditional anyway, it might as well all be predicated
appropriately.
Thank you for providing the useful comments. I would incorporate them all in next version :).

Regards,
Nipun
Robin.
Post by Nipun Gupta
/* Make sure all context banks are disabled and clear CB_FSR */
for (i = 0; i < smmu->num_context_banks; ++i) {
Robin Murphy
2016-11-03 11:08:34 UTC
Permalink
Post by Nipun Gupta
Hi Robin,
-----Original Message-----
Sent: Wednesday, November 02, 2016 16:51
Subject: Re: [PATCH] iommu: arm-smmu: Set SMTNMB_TLBEN in ACR to enable
caching of bypass entries
Hi Nipun,
Post by Nipun Gupta
The SMTNMB_TLBEN in the Auxiliary Configuration Register (ACR)
provides an option to enable the updation of TLB in case of bypass
transactions due to no stream match in the stream match table. This
reduces the latencies of the subsequent transactions with the same stream-id
which bypasses the SMMU.
Post by Nipun Gupta
This provides a significant performance benefit for certain networking
workloads.
...at the cost of increased TLB contention against other workloads.
However, in the general case we'd expect the system to be fully described, so if
there aren't any unmatched Stream IDs there hopefully shouldn't be an impact
to leaving this switched on. I'd be interested to see some actual performance
numbers, though - you already know my opinion about unsubstantiated quotes
from the MMU-500 TRM.
With this change we have seen substantial performance improvement of ~9-10%
with DPDK l3fwd application (http://dpdk.org/doc/guides/sample_app_ug/l3_forward.html)
on NXP's LS2088a platform (single core as well as multi-core). Also, with ODP reflector application
(loopback mode - NXP in-house) we have seen 5% improvement in performance on
LS1088 platform.
W.r.t. the read latencies, they are reduced to avg. ~50 platform cycles from avg. ~140
platform cycles per memory read transactions which follow this bypass path (on LS2088
with DPDK l3fwd application).
(Apologies, I cannot share the DPDK/ODP exact performance numbers on the mailing list).
That's understandable, and I'm not sure I'd know how to interpret them
anyway ;) I reckon the percentages make a sufficiently compelling
qualification of the improvement, so it would be good to have that
summarised in the commit log.
Post by Nipun Gupta
Post by Nipun Gupta
---
drivers/iommu/arm-smmu.c | 21 +++++++++++++++------
1 file changed, 15 insertions(+), 6 deletions(-)
diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c index
ce2a9d4..7010a5c 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -246,6 +246,7 @@ enum arm_smmu_s2cr_privcfg {
#define ARM_MMU500_ACTLR_CPRE (1 << 1)
+#define ACR_SMTNMB_TLBEN (1 << 8)
ACR is entirely implementation-defined, so there are no generic field names.
Please follow the naming convention handily demonstrated in the subsequent
context line.
Post by Nipun Gupta
#define ARM_MMU500_ACR_CACHE_LOCK (1 << 26)
Actually, can we also please keep these in descending order of bit position like
everything else?
Post by Nipun Gupta
#define CB_PAR_F (1 << 0)
@@ -1569,18 +1570,26 @@ static void arm_smmu_device_reset(struct
arm_smmu_device *smmu)
Post by Nipun Gupta
for (i = 0; i < smmu->num_mapping_groups; ++i)
arm_smmu_write_sme(smmu, i);
+ /* Get the major rev required for configuring ACR */
That comment is nonsense.
Post by Nipun Gupta
+ reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_ID7);
+ major = (reg >> ID7_MAJOR_SHIFT) & ID7_MAJOR_MASK;
+
/*
* Before clearing ARM_MMU500_ACTLR_CPRE, need to
* clear CACHE_LOCK bit of ACR first. And, CACHE_LOCK
* bit is only present in MMU-500r2 onwards.
*/
- reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_ID7);
- major = (reg >> ID7_MAJOR_SHIFT) & ID7_MAJOR_MASK;
- if ((smmu->model == ARM_MMU500) && (major >= 2)) {
- reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_sACR);
+ reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_sACR);
+ if ((smmu->model == ARM_MMU500) && (major >= 2))
reg &= ~ARM_MMU500_ACR_CACHE_LOCK;
- writel_relaxed(reg, gr0_base + ARM_SMMU_GR0_sACR);
- }
+
+ /*
+ * Set the SMTNMB_TLBEN in ACR so that the transactions which
+ * bypass with SMMU due to no stream match found in the SMR table
+ * are updated in the TLB's.
Or simply, e.g. "Allow unmatched Stream IDs to allocate bypass TLB entries for
reduced latency". It's already clear from the code what bit's being set where, we
only need to remember *why*.
Post by Nipun Gupta
+ */
+ reg |= ACR_SMTNMB_TLBEN;
+ writel_relaxed(reg, gr0_base + ARM_SMMU_GR0_sACR);
Are you sure it's perfectly fine to set that implementation-defined bit on any
SMMU implementation other than the two-and-a-half ARM Ltd. ones which
happen to share the same meaning? I'm certainly not.
if (smmu->model == ARM_MMU500) {
reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_ID7);
major = (reg >> ID7_MAJOR_SHIFT) & ID7_MAJOR_MASK;
reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_sACR);
if (major >= 2)
reg &= ~ARM_MMU500_ACR_CACHE_LOCK;
reg |= ACR_SMTNMB_TLBEN;
writel_relaxed(reg, gr0_base + ARM_SMMU_GR0_sACR);
}
The shape of the current code avoids an extra level of indentation, but once you
have to have the nested conditional anyway, it might as well all be predicated
appropriately.
Thank you for providing the useful comments. I would incorporate them all in next version :).
Cool. Just for clarity (I realise I was thinking it, but never said it
outright), whilst MMU-40x do share the same feature with the same ACR
bit definition as MMU-500, I'd be inclined not to bother with them.
Since the monolithic microarchitecture means there's normally a separate
MMU-40x per device, if you don't want translation for that device you
can simply not probe the thing to turn it on in the first place.

Robin.
Post by Nipun Gupta
Regards,
Nipun
Robin.
Post by Nipun Gupta
/* Make sure all context banks are disabled and clear CB_FSR */
for (i = 0; i < smmu->num_context_banks; ++i) {
Nipun Gupta
2016-11-03 18:24:11 UTC
Permalink
Hi Robin,
-----Original Message-----
Sent: Thursday, November 03, 2016 16:39
Subject: Re: [PATCH] iommu: arm-smmu: Set SMTNMB_TLBEN in ACR to enable
caching of bypass entries
Post by Nipun Gupta
Hi Robin,
-----Original Message-----
Sent: Wednesday, November 02, 2016 16:51
Subject: Re: [PATCH] iommu: arm-smmu: Set SMTNMB_TLBEN in ACR to
enable
Post by Nipun Gupta
caching of bypass entries
Hi Nipun,
Post by Nipun Gupta
The SMTNMB_TLBEN in the Auxiliary Configuration Register (ACR)
provides an option to enable the updation of TLB in case of bypass
transactions due to no stream match in the stream match table. This
reduces the latencies of the subsequent transactions with the same stream-
id
Post by Nipun Gupta
which bypasses the SMMU.
Post by Nipun Gupta
This provides a significant performance benefit for certain networking
workloads.
...at the cost of increased TLB contention against other workloads.
However, in the general case we'd expect the system to be fully described, so
if
Post by Nipun Gupta
there aren't any unmatched Stream IDs there hopefully shouldn't be an
impact
Post by Nipun Gupta
to leaving this switched on. I'd be interested to see some actual performance
numbers, though - you already know my opinion about unsubstantiated
quotes
Post by Nipun Gupta
from the MMU-500 TRM.
With this change we have seen substantial performance improvement of ~9-
10%
Post by Nipun Gupta
with DPDK l3fwd application
(http://dpdk.org/doc/guides/sample_app_ug/l3_forward.html)
Post by Nipun Gupta
on NXP's LS2088a platform (single core as well as multi-core). Also, with ODP
reflector application
Post by Nipun Gupta
(loopback mode - NXP in-house) we have seen 5% improvement in
performance on
Post by Nipun Gupta
LS1088 platform.
W.r.t. the read latencies, they are reduced to avg. ~50 platform cycles from
avg. ~140
Post by Nipun Gupta
platform cycles per memory read transactions which follow this bypass path
(on LS2088
Post by Nipun Gupta
with DPDK l3fwd application).
(Apologies, I cannot share the DPDK/ODP exact performance numbers on the
mailing list).
That's understandable, and I'm not sure I'd know how to interpret them
anyway ;) I reckon the percentages make a sufficiently compelling
qualification of the improvement, so it would be good to have that
summarised in the commit log.
Sure, I'll add a part of it in the commit log.
Post by Nipun Gupta
Post by Nipun Gupta
---
drivers/iommu/arm-smmu.c | 21 +++++++++++++++------
1 file changed, 15 insertions(+), 6 deletions(-)
diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c index
ce2a9d4..7010a5c 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -246,6 +246,7 @@ enum arm_smmu_s2cr_privcfg {
#define ARM_MMU500_ACTLR_CPRE (1 << 1)
+#define ACR_SMTNMB_TLBEN (1 << 8)
ACR is entirely implementation-defined, so there are no generic field names.
Please follow the naming convention handily demonstrated in the subsequent
context line.
Post by Nipun Gupta
#define ARM_MMU500_ACR_CACHE_LOCK (1 << 26)
Actually, can we also please keep these in descending order of bit position
like
Post by Nipun Gupta
everything else?
Post by Nipun Gupta
#define CB_PAR_F (1 << 0)
@@ -1569,18 +1570,26 @@ static void arm_smmu_device_reset(struct
arm_smmu_device *smmu)
Post by Nipun Gupta
for (i = 0; i < smmu->num_mapping_groups; ++i)
arm_smmu_write_sme(smmu, i);
+ /* Get the major rev required for configuring ACR */
That comment is nonsense.
Post by Nipun Gupta
+ reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_ID7);
+ major = (reg >> ID7_MAJOR_SHIFT) & ID7_MAJOR_MASK;
+
/*
* Before clearing ARM_MMU500_ACTLR_CPRE, need to
* clear CACHE_LOCK bit of ACR first. And, CACHE_LOCK
* bit is only present in MMU-500r2 onwards.
*/
- reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_ID7);
- major = (reg >> ID7_MAJOR_SHIFT) & ID7_MAJOR_MASK;
- if ((smmu->model == ARM_MMU500) && (major >= 2)) {
- reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_sACR);
+ reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_sACR);
+ if ((smmu->model == ARM_MMU500) && (major >= 2))
reg &= ~ARM_MMU500_ACR_CACHE_LOCK;
- writel_relaxed(reg, gr0_base + ARM_SMMU_GR0_sACR);
- }
+
+ /*
+ * Set the SMTNMB_TLBEN in ACR so that the transactions which
+ * bypass with SMMU due to no stream match found in the SMR table
+ * are updated in the TLB's.
Or simply, e.g. "Allow unmatched Stream IDs to allocate bypass TLB entries
for
Post by Nipun Gupta
reduced latency". It's already clear from the code what bit's being set where,
we
Post by Nipun Gupta
only need to remember *why*.
Post by Nipun Gupta
+ */
+ reg |= ACR_SMTNMB_TLBEN;
+ writel_relaxed(reg, gr0_base + ARM_SMMU_GR0_sACR);
Are you sure it's perfectly fine to set that implementation-defined bit on any
SMMU implementation other than the two-and-a-half ARM Ltd. ones which
happen to share the same meaning? I'm certainly not.
if (smmu->model == ARM_MMU500) {
reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_ID7);
major = (reg >> ID7_MAJOR_SHIFT) & ID7_MAJOR_MASK;
reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_sACR);
if (major >= 2)
reg &= ~ARM_MMU500_ACR_CACHE_LOCK;
reg |= ACR_SMTNMB_TLBEN;
writel_relaxed(reg, gr0_base + ARM_SMMU_GR0_sACR);
}
The shape of the current code avoids an extra level of indentation, but once
you
Post by Nipun Gupta
have to have the nested conditional anyway, it might as well all be predicated
appropriately.
Thank you for providing the useful comments. I would incorporate them all in
next version :).
Cool. Just for clarity (I realise I was thinking it, but never said it
outright), whilst MMU-40x do share the same feature with the same ACR
bit definition as MMU-500, I'd be inclined not to bother with them.
Since the monolithic microarchitecture means there's normally a separate
MMU-40x per device, if you don't want translation for that device you
can simply not probe the thing to turn it on in the first place.
This seems a pretty decent reason to have this bit set only for MMU-500.
I'll send a patch v2 soon.

Thanks,
Nipun
Robin.
Post by Nipun Gupta
Regards,
Nipun
Robin.
Post by Nipun Gupta
/* Make sure all context banks are disabled and clear CB_FSR */
for (i = 0; i < smmu->num_context_banks; ++i) {
Nipun Gupta
2016-11-04 00:27:23 UTC
Permalink
The SMTNMB_TLBEN in the Auxiliary Configuration Register (ACR) provides an
option to enable the updation of TLB in case of bypass transactions due to
no stream match in the stream match table. This reduces the latencies of
the subsequent transactions with the same stream-id which bypasses the SMMU.
This provides a significant performance benefit for certain networking
workloads.

With this change substantial performance improvement of ~9% is observed with
DPDK l3fwd application (http://dpdk.org/doc/guides/sample_app_ug/l3_forward.html)
on NXP's LS2088a platform.

Signed-off-by: Nipun Gupta <***@nxp.com>
---
Changes for v2:
- Incorporated Robin's comments on v1 related to
Setting SMTNMB_TLBEN in ACR only for MMU-500 as ACR is implementation dependent
Code comments and Naming convention

drivers/iommu/arm-smmu.c | 25 ++++++++++++++++---------
1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index ce2a9d4..05901be 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -247,6 +247,7 @@ enum arm_smmu_s2cr_privcfg {
#define ARM_MMU500_ACTLR_CPRE (1 << 1)

#define ARM_MMU500_ACR_CACHE_LOCK (1 << 26)
+#define ARM_MMU500_ACR_SMTNMB_TLBEN (1 << 8)

#define CB_PAR_F (1 << 0)

@@ -1569,16 +1570,22 @@ static void arm_smmu_device_reset(struct arm_smmu_device *smmu)
for (i = 0; i < smmu->num_mapping_groups; ++i)
arm_smmu_write_sme(smmu, i);

- /*
- * Before clearing ARM_MMU500_ACTLR_CPRE, need to
- * clear CACHE_LOCK bit of ACR first. And, CACHE_LOCK
- * bit is only present in MMU-500r2 onwards.
- */
- reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_ID7);
- major = (reg >> ID7_MAJOR_SHIFT) & ID7_MAJOR_MASK;
- if ((smmu->model == ARM_MMU500) && (major >= 2)) {
+ if (smmu->model == ARM_MMU500) {
+ /*
+ * Before clearing ARM_MMU500_ACTLR_CPRE, need to
+ * clear CACHE_LOCK bit of ACR first. And, CACHE_LOCK
+ * bit is only present in MMU-500r2 onwards.
+ */
+ reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_ID7);
+ major = (reg >> ID7_MAJOR_SHIFT) & ID7_MAJOR_MASK;
reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_sACR);
- reg &= ~ARM_MMU500_ACR_CACHE_LOCK;
+ if (major >= 2)
+ reg &= ~ARM_MMU500_ACR_CACHE_LOCK;
+ /*
+ * Allow unmatched Stream IDs to allocate bypass
+ * TLB entries for reduced latency.
+ */
+ reg |= ARM_MMU500_ACR_SMTNMB_TLBEN;
writel_relaxed(reg, gr0_base + ARM_SMMU_GR0_sACR);
}
--
1.9.1
Stuart Yoder
2016-11-03 20:57:26 UTC
Permalink
-----Original Message-----
Sent: Thursday, November 03, 2016 7:27 PM
foundation.org
Subject: [PATCH] iommu: arm-smmu: Set SMTNMB_TLBEN in ACR to enable caching of bypass entries
When you respin a patch, put the version number in the patch subject:

[PATCH v2] iommu: arm-smmu: Set SMTNMB_TLBEN in ACR to enable caching of bypass entries

Stuart
Nipun Gupta
2016-11-04 03:52:16 UTC
Permalink
-----Original Message-----
From: Stuart Yoder
Sent: Friday, November 04, 2016 2:27
Subject: RE: [PATCH] iommu: arm-smmu: Set SMTNMB_TLBEN in ACR to enable
caching of bypass entries
-----Original Message-----
Sent: Thursday, November 03, 2016 7:27 PM
foundation.org
Subject: [PATCH] iommu: arm-smmu: Set SMTNMB_TLBEN in ACR to enable
caching of bypass entries
[PATCH v2] iommu: arm-smmu: Set SMTNMB_TLBEN in ACR to enable caching of bypass entries
I added that first, but in the last moment used git format-patch after minor updates
and somehow missed it in the final patch :/. Re-spinning v3 with correct version number.

Thanks,
Nipun
Stuart
Loading...