Performance decrease

Discussion:

Nathan B

2015-04-23 14:46:20 UTC

I recently switched from Aleph 0.3.2 to the new Aleph 0.4.0 and noticed
about a doubling of CPU usage for the same load after the switch. We use
Aleph for a Websocket server. The basic change that we made was to use
let-flow to grab the websocket and then instead of lamina/receive-all we
switched to use stream/consume. Only other change was in sending results
back to the client we are using stream/put! and throwing away the deferred
as we don't need to track success of the send.

Any thoughts on what could be going on here to decrease performance so
much? Is there a way to do the send without creating a deferred when you
prefer fire and forget semantics as maybe that is generating substantially
more garbage collection in our scenario for functionality we don't use?

--
You received this message because you are subscribed to the Google Groups "Aleph" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aleph-lib+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Zach Tellman

2015-04-23 16:12:22 UTC

Permalink

Hi Nathan,

Can you characterize a few things about your system:

* how many connections are opened per second
* how long are connections left open
* how many messages per second do you get per connection

This may be overhead in the protocol negotiation, or in the communication
once the connection is opened, or both. Lamina was certainly a faster
stream implementation w.r.t. throughput than Manifold currently is, but
Manifold is still 2-3x faster than core.async, and I'd expect the
difference between all of them to be negligible compared to the cost of
sending data over the network, which the new version of Netty should make
faster.

If you really want to make this easy for me, using YourKit to capture a
sampled profile of your system in production, run for 20-30 minutes, would
be very helpful. However, answers to the above questions would at least
allow me to write a simple benchmark.

Zach

Post by Nathan B
I recently switched from Aleph 0.3.2 to the new Aleph 0.4.0 and noticed
about a doubling of CPU usage for the same load after the switch. We use
Aleph for a Websocket server. The basic change that we made was to use
let-flow to grab the websocket and then instead of lamina/receive-all we
switched to use stream/consume. Only other change was in sending results
back to the client we are using stream/put! and throwing away the deferred
as we don't need to track success of the send.
Any thoughts on what could be going on here to decrease performance so
much? Is there a way to do the send without creating a deferred when you
prefer fire and forget semantics as maybe that is generating substantially
more garbage collection in our scenario for functionality we don't use?
--
You received this message because you are subscribed to the Google Groups "Aleph" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.

Nathan B

2015-04-24 13:04:33 UTC

Permalink

Zach:

Thanks for the quick reply. I worked on trying to diagnose this better
yesterday so that I could pinpoint down where the issue was. We don't have
a Youkit license to get you that, but I was finally able to get the
VisualVm tool hooked up and the hotspot method after running it for a bit
was io.netty.channel.epoll.Native.epollWait0 with 98% of the CPU time.

This made me think that maybe it was related to a bug in the native epoll
stuffy in netty, but after removing the native .so file to force NIO, the
load on the box was about the same. I wasn't able to hook up VisualVM after
making that change yet though, so I am going to do that to see if it is in
the same event loop area of the code.

Our usage is a high rate of connects/disconnects in relationship to the
actual amount of data transmitted per connection. But it seems like all of
the cpu is in the epoll loop which is odd as you would think that
processing the data received/transmitted in our clojure code would overtake
the low level C epoll stuff.

Post by Zach Tellman
Hi Nathan,
* how many connections are opened per second
* how long are connections left open
* how many messages per second do you get per connection
This may be overhead in the protocol negotiation, or in the communication
once the connection is opened, or both. Lamina was certainly a faster
stream implementation w.r.t. throughput than Manifold currently is, but
Manifold is still 2-3x faster than core.async, and I'd expect the
difference between all of them to be negligible compared to the cost of
sending data over the network, which the new version of Netty should make
faster.
If you really want to make this easy for me, using YourKit to capture a
sampled profile of your system in production, run for 20-30 minutes, would
be very helpful. However, answers to the above questions would at least
allow me to write a simple benchmark.
Zach

Zach Tellman

2015-04-24 16:41:37 UTC

Permalink

The profiler doesn't differentiate between a thread waiting because it's
busy, or because it's just waiting for something to happen. In the case of
'io.netty.channel.epoll.Native.epollWait0', it's the latter. In other
words, your server is only processing input off the network ~2% of the
time. Can you elaborate on the actual numbers for the rate of connections,
messages, etc?

Post by Nathan B
Thanks for the quick reply. I worked on trying to diagnose this better
yesterday so that I could pinpoint down where the issue was. We don't have
a Youkit license to get you that, but I was finally able to get the
VisualVm tool hooked up and the hotspot method after running it for a bit
was io.netty.channel.epoll.Native.epollWait0 with 98% of the CPU time.
This made me think that maybe it was related to a bug in the native epoll
stuffy in netty, but after removing the native .so file to force NIO, the
load on the box was about the same. I wasn't able to hook up VisualVM after
making that change yet though, so I am going to do that to see if it is in
the same event loop area of the code.
Our usage is a high rate of connects/disconnects in relationship to the
actual amount of data transmitted per connection. But it seems like all of
the cpu is in the epoll loop which is odd as you would think that
processing the data received/transmitted in our clojure code would overtake
the low level C epoll stuff.

Post by Nathan B
I recently switched from Aleph 0.3.2 to the new Aleph 0.4.0 and noticed
about a doubling of CPU usage for the same load after the switch. We use
Aleph for a Websocket server. The basic change that we made was to use
let-flow to grab the websocket and then instead of lamina/receive-all we
switched to use stream/consume. Only other change was in sending results
back to the client we are using stream/put! and throwing away the deferred
as we don't need to track success of the send.
Any thoughts on what could be going on here to decrease performance so
much? Is there a way to do the send without creating a deferred when you
prefer fire and forget semantics as maybe that is generating substantially
more garbage collection in our scenario for functionality we don't use?
--
You received this message because you are subscribed to the Google
Groups "Aleph" group.
To unsubscribe from this group and stop receiving emails from it, send
For more options, visit https://groups.google.com/d/optout.

You received this message because you are subscribed to the Google Groups "Aleph" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.

Tim Clemons

2015-06-16 18:59:05 UTC

Permalink

We've had issues with a somewhat infamous Netty epoll bug. Try adding the
following to your JAVA_OPTS as a workaround:

-Dorg.jboss.netty.epollBugWorkaround=true

Post by Nathan B
I recently switched from Aleph 0.3.2 to the new Aleph 0.4.0 and noticed
about a doubling of CPU usage for the same load after the switch. We use
Aleph for a Websocket server. The basic change that we made was to use
let-flow to grab the websocket and then instead of lamina/receive-all we
switched to use stream/consume. Only other change was in sending results
back to the client we are using stream/put! and throwing away the deferred
as we don't need to track success of the send.
Any thoughts on what could be going on here to decrease performance so
much? Is there a way to do the send without creating a deferred when you
prefer fire and forget semantics as maybe that is generating substantially
more garbage collection in our scenario for functionality we don't use?
--
You received this message because you are subscribed to the Google
Groups "Aleph" group.
To unsubscribe from this group and stop receiving emails from it, send
For more options, visit https://groups.google.com/d/optout.

Zach Tellman

2015-06-16 19:13:40 UTC

Permalink

My understanding is that workaround was for pre-v4 Netty versions. Can you
provide any references for that still being a concern in the latest
versions?

Post by Tim Clemons
We've had issues with a somewhat infamous Netty epoll bug. Try adding the
-Dorg.jboss.netty.epollBugWorkaround=true

Post by Zach Tellman
Hi Nathan,
* how many connections are opened per second
* how long are connections left open
* how many messages per second do you get per connection
This may be overhead in the protocol negotiation, or in the
communication once the connection is opened, or both. Lamina was certainly
a faster stream implementation w.r.t. throughput than Manifold currently
is, but Manifold is still 2-3x faster than core.async, and I'd expect the
difference between all of them to be negligible compared to the cost of
sending data over the network, which the new version of Netty should make
faster.
If you really want to make this easy for me, using YourKit to capture a
sampled profile of your system in production, run for 20-30 minutes, would
be very helpful. However, answers to the above questions would at least
allow me to write a simple benchmark.
Zach

Post by Nathan B
I recently switched from Aleph 0.3.2 to the new Aleph 0.4.0 and noticed
about a doubling of CPU usage for the same load after the switch. We use
Aleph for a Websocket server. The basic change that we made was to use
let-flow to grab the websocket and then instead of lamina/receive-all we
switched to use stream/consume. Only other change was in sending results
back to the client we are using stream/put! and throwing away the deferred
as we don't need to track success of the send.
Any thoughts on what could be going on here to decrease performance so
much? Is there a way to do the send without creating a deferred when you
prefer fire and forget semantics as maybe that is generating substantially
more garbage collection in our scenario for functionality we don't use?
--
You received this message because you are subscribed to the Google
Groups "Aleph" group.
To unsubscribe from this group and stop receiving emails from it, send
For more options, visit https://groups.google.com/d/optout.