aleph client crash with OutOfMemoryError

Discussion:

Reynald Borer

2018-06-25 08:01:37 UTC

Hi folks,

I'm using Aleph as an http client to crawl multiple different URLs for
https://paper.li/ service.

On my testing environment, which is processing around 350 different URLs
per minute, Aleph pool seems to exhaust the heap (set to 9G) after a few
hours of processing. The culprit seems to be the Pool itself which is
retaining a lot of long & double instances.

I've posted my initial heap dump analysis
under https://github.com/ztellman/aleph/issues/394 . If you've already
experienced this kind of issue, your help is greatly appreciated.

I'll continue digging into aleph code to try to get a better understanding
of how it may happens.

Best regards,
Reynald

--
You received this message because you are subscribed to the Google Groups "Aleph" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aleph-lib+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Zach Tellman

2018-06-25 12:40:15 UTC

Permalink

Hi Reynald,

How many different domains are you crawling, and how many times do you
query each one? Given that each reservoir in Dirigiste is pretty small,
I'd guess a lot. The HashMap detail is very helpful, because there's only
one non-concurrent hash map anywhere in the code:
https://github.com/ztellman/dirigiste/blob/master/src/io/aleph/dirigiste/Pool.java#L209.
This is meant to be regularly emptied, but it's possible there's something
about your request pattern that's not triggering that. If you can provide
a bit more detail, I can probably get some sort of fix later today.

Zach

Post by Reynald Borer
Hi folks,
I'm using Aleph as an http client to crawl multiple different URLs for
https://paper.li/ service.
On my testing environment, which is processing around 350 different URLs
per minute, Aleph pool seems to exhaust the heap (set to 9G) after a few
hours of processing. The culprit seems to be the Pool itself which is
retaining a lot of long & double instances.
I've posted my initial heap dump analysis under
https://github.com/ztellman/aleph/issues/394 . If you've already
experienced this kind of issue, your help is greatly appreciated.
I'll continue digging into aleph code to try to get a better understanding
of how it may happens.
Best regards,
Reynald
--
You received this message because you are subscribed to the Google Groups "Aleph" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.

Reynald Borer

2018-06-25 12:52:03 UTC

Permalink

Hi Zach,

You're right assuming that we crawl multiple different domains! Our service
listen to content that is published on social medias (mostly Twitter in
fact), and we crawl all the links we find to discover content. This means
that every day, we crawl around 25 million URLs. URLs tends to follow a
logarithmic distribution (that is, few domains have a lot of distinct URLs,
while a lot of domains have a small number of URLs).

Sorry to not provides more accurate numbers. I can try to compute them but
it'll take me a bit of time.

On my side, I haven't been able (yet) to find a triggering pattern for this
OOM. Right now, I have enabled the `idle-timeout` parameter on the http
pool (it wasn't set). Although I am not sure it could have an impact on
memory pressure.

Cheers,
Reynald

Post by Zach Tellman
Hi Reynald,
How many different domains are you crawling, and how many times do you
query each one? Given that each reservoir in Dirigiste is pretty small,
I'd guess a lot. The HashMap detail is very helpful, because there's only
https://github.com/ztellman/dirigiste/blob/master/src/io/aleph/dirigiste/Pool.java#L209.
This is meant to be regularly emptied, but it's possible there's something
about your request pattern that's not triggering that. If you can provide
a bit more detail, I can probably get some sort of fix later today.
Zach

Post by Reynald Borer
Hi folks,
I'm using Aleph as an http client to crawl multiple different URLs for
https://paper.li/ service.
On my testing environment, which is processing around 350 different URLs
per minute, Aleph pool seems to exhaust the heap (set to 9G) after a few
hours of processing. The culprit seems to be the Pool itself which is
retaining a lot of long & double instances.
I've posted my initial heap dump analysis under
https://github.com/ztellman/aleph/issues/394 . If you've already
experienced this kind of issue, your help is greatly appreciated.
I'll continue digging into aleph code to try to get a better
understanding of how it may happens.
Best regards,
Reynald
--
You received this message because you are subscribed to the Google Groups "Aleph" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Aleph" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.

Zach Tellman

2018-06-25 13:05:31 UTC

Permalink

Every HTTP request in Aleph uses a connection pool, and every pool has a
"Pool" object. By default, requests will use
`aleph.http/default-connection-pool`, but you can specify a different one
using the :pool option in your request. The quick fix here is to specify
your own pool, and periodically swap them out (GC should take care of the
rest). However, the HashMap that seems to be retaining all this data is
supposed to be a WeakHashSet, so theoretically it should be cleaning itself
up, and right now I'm not sure why it isn't. I'd like to figure this out,
but if this is hindering production I suggest using the workaround, and I
try to reproduce your issue locally.

Zach

Post by Reynald Borer
Hi Zach,
You're right assuming that we crawl multiple different domains! Our
service listen to content that is published on social medias (mostly
Twitter in fact), and we crawl all the links we find to discover content.
This means that every day, we crawl around 25 million URLs. URLs tends to
follow a logarithmic distribution (that is, few domains have a lot of
distinct URLs, while a lot of domains have a small number of URLs).
Sorry to not provides more accurate numbers. I can try to compute them but
it'll take me a bit of time.
On my side, I haven't been able (yet) to find a triggering pattern for
this OOM. Right now, I have enabled the `idle-timeout` parameter on the
http pool (it wasn't set). Although I am not sure it could have an impact
on memory pressure.
Cheers,
Reynald

Post by Reynald Borer
Hi folks,
I'm using Aleph as an http client to crawl multiple different URLs for
https://paper.li/ service.
On my testing environment, which is processing around 350 different URLs
per minute, Aleph pool seems to exhaust the heap (set to 9G) after a few
hours of processing. The culprit seems to be the Pool itself which is
retaining a lot of long & double instances.
I've posted my initial heap dump analysis under
https://github.com/ztellman/aleph/issues/394 . If you've already
experienced this kind of issue, your help is greatly appreciated.
I'll continue digging into aleph code to try to get a better
understanding of how it may happens.
Best regards,
Reynald
--
You received this message because you are subscribed to the Google
Groups "Aleph" group.
To unsubscribe from this group and stop receiving emails from it, send
For more options, visit https://groups.google.com/d/optout.

Reynald Borer

2018-06-25 13:23:27 UTC

Permalink

It's not running in production yet, so it's safe to play a bit with it :-)

Post by Zach Tellman
Every HTTP request in Aleph uses a connection pool, and every pool has a
"Pool" object. By default, requests will use
`aleph.http/default-connection-pool`, but you can specify a different one
using the :pool option in your request. The quick fix here is to specify
your own pool, and periodically swap them out (GC should take care of the
rest). However, the HashMap that seems to be retaining all this data is
supposed to be a WeakHashSet, so theoretically it should be cleaning itself
up, and right now I'm not sure why it isn't. I'd like to figure this out,
but if this is hindering production I suggest using the workaround, and I
try to reproduce your issue locally.
Zach

Post by Reynald Borer
Hi Zach,
You're right assuming that we crawl multiple different domains! Our
service listen to content that is published on social medias (mostly
Twitter in fact), and we crawl all the links we find to discover content.
This means that every day, we crawl around 25 million URLs. URLs tends to
follow a logarithmic distribution (that is, few domains have a lot of
distinct URLs, while a lot of domains have a small number of URLs).
Sorry to not provides more accurate numbers. I can try to compute them
but it'll take me a bit of time.
On my side, I haven't been able (yet) to find a triggering pattern for
this OOM. Right now, I have enabled the `idle-timeout` parameter on the
http pool (it wasn't set). Although I am not sure it could have an impact
on memory pressure.
Cheers,
Reynald

Post by Reynald Borer
Hi folks,
I'm using Aleph as an http client to crawl multiple different URLs for
https://paper.li/ service.
On my testing environment, which is processing around 350 different
URLs per minute, Aleph pool seems to exhaust the heap (set to 9G) after a
few hours of processing. The culprit seems to be the Pool itself which is
retaining a lot of long & double instances.
I've posted my initial heap dump analysis under
https://github.com/ztellman/aleph/issues/394 . If you've already
experienced this kind of issue, your help is greatly appreciated.
I'll continue digging into aleph code to try to get a better
understanding of how it may happens.
Best regards,
Reynald
--
You received this message because you are subscribed to the Google
Groups "Aleph" group.
To unsubscribe from this group and stop receiving emails from it, send
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "Aleph" group.
To unsubscribe from this group and stop receiving emails from it, send
For more options, visit https://groups.google.com/d/optout.

Zach Tellman

2018-06-25 13:56:12 UTC

Permalink

Oh, Iâm not sure how I missed that. That does seem to be a more likely
culprit. Iâll take a closer look this evening.

Post by Reynald Borer
It's not running in production yet, so it's safe to play a bit with it :-)

Post by Reynald Borer
Hi Zach,
You're right assuming that we crawl multiple different domains! Our
service listen to content that is published on social medias (mostly
Twitter in fact), and we crawl all the links we find to discover content.
This means that every day, we crawl around 25 million URLs. URLs tends to
follow a logarithmic distribution (that is, few domains have a lot of
distinct URLs, while a lot of domains have a small number of URLs).
Sorry to not provides more accurate numbers. I can try to compute them
but it'll take me a bit of time.
On my side, I haven't been able (yet) to find a triggering pattern for
this OOM. Right now, I have enabled the `idle-timeout` parameter on the
http pool (it wasn't set). Although I am not sure it could have an impact
on memory pressure.
Cheers,
Reynald

Post by Reynald Borer
Hi folks,
I'm using Aleph as an http client to crawl multiple different URLs for
https://paper.li/ service.
On my testing environment, which is processing around 350 different
URLs per minute, Aleph pool seems to exhaust the heap (set to 9G) after a
few hours of processing. The culprit seems to be the Pool itself which is
retaining a lot of long & double instances.
I've posted my initial heap dump analysis under
https://github.com/ztellman/aleph/issues/394 . If you've already
experienced this kind of issue, your help is greatly appreciated.
I'll continue digging into aleph code to try to get a better
understanding of how it may happens.
Best regards,
Reynald
--
You received this message because you are subscribed to the Google
Groups "Aleph" group.
To unsubscribe from this group and stop receiving emails from it, send
For more options, visit https://groups.google.com/d/optout.

Reynald Borer

2018-06-28 06:51:45 UTC

Permalink

Hi,

Zach, did you had time to investigate this issue?

On my side, I started looking at dirigiste more closely, and I'm proposing
the following changes:

a. https://github.com/ztellman/dirigiste/pull/18

The Pool instance retains a _stats variable with the gathered stats. Since
the sole usage of those _stats are within startControlLoop() method, I've
moved this field directly within this method. This should help reducing
memory usage.

b. https://github.com/ztellman/dirigiste/pull/17

I ran a SonarQube static code analysis on Dirigiste hoping it would show
code smells that could indicate why I encounter high memory usage. It
didn't find blocker problems (few), but highlighted some potential issues
we should have a look at. #17 proposes some changes that should help
slightly reducing memory usage too.

Since SonarQube offers a cloud version that is free for open-source
projects (https://about.sonarcloud.io/sq/) it may be interesting to
configure Dirigiste there (it doesn't support Clojure code though).

Feedbacks are more than welcome on both pull requests. Since I don't know
Dirigiste internals much, I may be wrong.

Cheers,
Reynald

Post by Zach Tellman
Oh, Iâm not sure how I missed that. That does seem to be a more likely
culprit. Iâll take a closer look this evening.

Post by Reynald Borer
It's not running in production yet, so it's safe to play a bit with it :-)

Post by Reynald Borer
Hi Zach,
You're right assuming that we crawl multiple different domains! Our
service listen to content that is published on social medias (mostly
Twitter in fact), and we crawl all the links we find to discover content.
This means that every day, we crawl around 25 million URLs. URLs tends to
follow a logarithmic distribution (that is, few domains have a lot of
distinct URLs, while a lot of domains have a small number of URLs).
Sorry to not provides more accurate numbers. I can try to compute them
but it'll take me a bit of time.
On my side, I haven't been able (yet) to find a triggering pattern for
this OOM. Right now, I have enabled the `idle-timeout` parameter on the
http pool (it wasn't set). Although I am not sure it could have an impact
on memory pressure.
Cheers,
Reynald

Post by Reynald Borer
Hi folks,
I'm using Aleph as an http client to crawl multiple different URLs
for https://paper.li/ service.
On my testing environment, which is processing around 350 different
URLs per minute, Aleph pool seems to exhaust the heap (set to 9G) after a
few hours of processing. The culprit seems to be the Pool itself which is
retaining a lot of long & double instances.
I've posted my initial heap dump analysis under
https://github.com/ztellman/aleph/issues/394 . If you've already
experienced this kind of issue, your help is greatly appreciated.
I'll continue digging into aleph code to try to get a better
understanding of how it may happens.
Best regards,
Reynald
--
You received this message because you are subscribed to the Google
Groups "Aleph" group.
To unsubscribe from this group and stop receiving emails from it,
For more options, visit https://groups.google.com/d/optout.

Reynald Borer

2018-06-28 08:56:08 UTC

Permalink

Me again :-)

I was finally able to run Eclipse Memory Analyzer on my Macbook pro so I
could continue the analysis. I've posted my findings under
https://github.com/ztellman/aleph/issues/394#issuecomment-400962912 .

Obviously, https://github.com/ztellman/dirigiste/pull/18 is a step in the
right direction. Although we may want to consider reducing the overall
memory footprint of those stats too.

Cheers,
Reynald

Post by Reynald Borer
Hi,
Zach, did you had time to investigate this issue?
On my side, I started looking at dirigiste more closely, and I'm proposing
a. https://github.com/ztellman/dirigiste/pull/18
The Pool instance retains a _stats variable with the gathered stats. Since
the sole usage of those _stats are within startControlLoop() method, I've
moved this field directly within this method. This should help reducing
memory usage.
b. https://github.com/ztellman/dirigiste/pull/17
I ran a SonarQube static code analysis on Dirigiste hoping it would show
code smells that could indicate why I encounter high memory usage. It
didn't find blocker problems (few), but highlighted some potential issues
we should have a look at. #17 proposes some changes that should help
slightly reducing memory usage too.
Since SonarQube offers a cloud version that is free for open-source
projects (https://about.sonarcloud.io/sq/) it may be interesting to
configure Dirigiste there (it doesn't support Clojure code though).
Feedbacks are more than welcome on both pull requests. Since I don't know
Dirigiste internals much, I may be wrong.
Cheers,
Reynald

Post by Zach Tellman
Oh, Iâm not sure how I missed that. That does seem to be a more likely
culprit. Iâll take a closer look this evening.

Post by Reynald Borer
It's not running in production yet, so it's safe to play a bit with it :-)

Post by Zach Tellman
Every HTTP request in Aleph uses a connection pool, and every pool has
a "Pool" object. By default, requests will use
`aleph.http/default-connection-pool`, but you can specify a different one
using the :pool option in your request. The quick fix here is to specify
your own pool, and periodically swap them out (GC should take care of the
rest). However, the HashMap that seems to be retaining all this data is
supposed to be a WeakHashSet, so theoretically it should be cleaning itself
up, and right now I'm not sure why it isn't. I'd like to figure this out,
but if this is hindering production I suggest using the workaround, and I
try to reproduce your issue locally.
Zach

Post by Reynald Borer
Hi Zach,
You're right assuming that we crawl multiple different domains! Our
service listen to content that is published on social medias (mostly
Twitter in fact), and we crawl all the links we find to discover content.
This means that every day, we crawl around 25 million URLs. URLs tends to
follow a logarithmic distribution (that is, few domains have a lot of
distinct URLs, while a lot of domains have a small number of URLs).
Sorry to not provides more accurate numbers. I can try to compute them
but it'll take me a bit of time.
On my side, I haven't been able (yet) to find a triggering pattern for
this OOM. Right now, I have enabled the `idle-timeout` parameter on the
http pool (it wasn't set). Although I am not sure it could have an impact
on memory pressure.
Cheers,
Reynald

Post by Zach Tellman
Hi Reynald,
How many different domains are you crawling, and how many times do
you query each one? Given that each reservoir in Dirigiste is pretty
small, I'd guess a lot. The HashMap detail is very helpful, because
https://github.com/ztellman/dirigiste/blob/master/src/io/aleph/dirigiste/Pool.java#L209.
This is meant to be regularly emptied, but it's possible there's something
about your request pattern that's not triggering that. If you can provide
a bit more detail, I can probably get some sort of fix later today.
Zach
On Mon, Jun 25, 2018 at 3:01 AM Reynald Borer <

Post by Reynald Borer
Hi folks,
I'm using Aleph as an http client to crawl multiple different URLs
for https://paper.li/ service.
On my testing environment, which is processing around 350 different
URLs per minute, Aleph pool seems to exhaust the heap (set to 9G) after a
few hours of processing. The culprit seems to be the Pool itself which is
retaining a lot of long & double instances.
I've posted my initial heap dump analysis under
https://github.com/ztellman/aleph/issues/394 . If you've already
experienced this kind of issue, your help is greatly appreciated.
I'll continue digging into aleph code to try to get a better
understanding of how it may happens.
Best regards,
Reynald
--
You received this message because you are subscribed to the Google
Groups "Aleph" group.
To unsubscribe from this group and stop receiving emails from it,
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "Aleph" group.
To unsubscribe from this group and stop receiving emails from it,
For more options, visit https://groups.google.com/d/optout.

Zach Tellman

2018-06-28 12:49:35 UTC

Permalink

My apologies for the delay, I'm in the middle of a contract this week and
it's taken up all of my mental energy. I'll be able to look at everything
this weekend.

Post by Reynald Borer
Me again :-)
I was finally able to run Eclipse Memory Analyzer on my Macbook pro so I
could continue the analysis. I've posted my findings under
https://github.com/ztellman/aleph/issues/394#issuecomment-400962912 .
Obviously, https://github.com/ztellman/dirigiste/pull/18 is a step in the
right direction. Although we may want to consider reducing the overall
memory footprint of those stats too.
Cheers,
Reynald

Post by Reynald Borer
Hi,
Zach, did you had time to investigate this issue?
On my side, I started looking at dirigiste more closely, and I'm
a. https://github.com/ztellman/dirigiste/pull/18
The Pool instance retains a _stats variable with the gathered stats.
Since the sole usage of those _stats are within startControlLoop() method,
I've moved this field directly within this method. This should help
reducing memory usage.
b. https://github.com/ztellman/dirigiste/pull/17
I ran a SonarQube static code analysis on Dirigiste hoping it would show
code smells that could indicate why I encounter high memory usage. It
didn't find blocker problems (few), but highlighted some potential issues
we should have a look at. #17 proposes some changes that should help
slightly reducing memory usage too.
Since SonarQube offers a cloud version that is free for open-source
projects (https://about.sonarcloud.io/sq/) it may be interesting to
configure Dirigiste there (it doesn't support Clojure code though).
Feedbacks are more than welcome on both pull requests. Since I don't know
Dirigiste internals much, I may be wrong.
Cheers,
Reynald

Post by Zach Tellman
Oh, Iâm not sure how I missed that. That does seem to be a more likely
culprit. Iâll take a closer look this evening.

Post by Reynald Borer
It's not running in production yet, so it's safe to play a bit with it :-)

Post by Zach Tellman
Every HTTP request in Aleph uses a connection pool, and every pool has
a "Pool" object. By default, requests will use
`aleph.http/default-connection-pool`, but you can specify a different one
using the :pool option in your request. The quick fix here is to specify
your own pool, and periodically swap them out (GC should take care of the
rest). However, the HashMap that seems to be retaining all this data is
supposed to be a WeakHashSet, so theoretically it should be cleaning itself
up, and right now I'm not sure why it isn't. I'd like to figure this out,
but if this is hindering production I suggest using the workaround, and I
try to reproduce your issue locally.
Zach

Post by Reynald Borer
Hi Zach,
You're right assuming that we crawl multiple different domains! Our
service listen to content that is published on social medias (mostly
Twitter in fact), and we crawl all the links we find to discover content.
This means that every day, we crawl around 25 million URLs. URLs tends to
follow a logarithmic distribution (that is, few domains have a lot of
distinct URLs, while a lot of domains have a small number of URLs).
Sorry to not provides more accurate numbers. I can try to compute
them but it'll take me a bit of time.
On my side, I haven't been able (yet) to find a triggering pattern
for this OOM. Right now, I have enabled the `idle-timeout` parameter on the
http pool (it wasn't set). Although I am not sure it could have an impact
on memory pressure.
Cheers,
Reynald

Post by Zach Tellman
Hi Reynald,
How many different domains are you crawling, and how many times do
you query each one? Given that each reservoir in Dirigiste is pretty
small, I'd guess a lot. The HashMap detail is very helpful, because
https://github.com/ztellman/dirigiste/blob/master/src/io/aleph/dirigiste/Pool.java#L209.
This is meant to be regularly emptied, but it's possible there's something
about your request pattern that's not triggering that. If you can provide
a bit more detail, I can probably get some sort of fix later today.
Zach
On Mon, Jun 25, 2018 at 3:01 AM Reynald Borer <

Post by Reynald Borer
Hi folks,
I'm using Aleph as an http client to crawl multiple different URLs
for https://paper.li/ service.
On my testing environment, which is processing around 350 different
URLs per minute, Aleph pool seems to exhaust the heap (set to 9G) after a
few hours of processing. The culprit seems to be the Pool itself which is
retaining a lot of long & double instances.
I've posted my initial heap dump analysis under
https://github.com/ztellman/aleph/issues/394 . If you've already
experienced this kind of issue, your help is greatly appreciated.
I'll continue digging into aleph code to try to get a better
understanding of how it may happens.
Best regards,
Reynald
--
You received this message because you are subscribed to the Google
Groups "Aleph" group.
To unsubscribe from this group and stop receiving emails from it,
For more options, visit https://groups.google.com/d/optout.

You received this message because you are subscribed to the Google Groups "Aleph" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.

Ivan

2018-06-25 13:22:45 UTC

Permalink

Hi Zach,

There is one more HashMap in the Pool
https://github.com/ztellman/dirigiste/blob/795dc03a579465ced27e76ba5ab0c3254671ed10/src/io/aleph/dirigiste/Pool.java#L204
It's initialized using the updateStats which returns a HashMap
https://github.com/ztellman/dirigiste/blob/795dc03a579465ced27e76ba5ab0c3254671ed10/src/io/aleph/dirigiste/Pool.java#L245

Ivan

Post by Zach Tellman
Hi Reynald,
How many different domains are you crawling, and how many times do you
query each one? Given that each reservoir in Dirigiste is pretty small,
I'd guess a lot. The HashMap detail is very helpful, because there's only
one non-concurrent hash map anywhere in the code: https://github.com/
ztellman/dirigiste/blob/master/src/io/aleph/dirigiste/Pool.java#L209.
This is meant to be regularly emptied, but it's possible there's something
about your request pattern that's not triggering that. If you can provide
a bit more detail, I can probably get some sort of fix later today.
Zach

Post by Reynald Borer
Hi folks,
I'm using Aleph as an http client to crawl multiple different URLs for
https://paper.li/ service.
On my testing environment, which is processing around 350 different URLs
per minute, Aleph pool seems to exhaust the heap (set to 9G) after a few
hours of processing. The culprit seems to be the Pool itself which is
retaining a lot of long & double instances.
I've posted my initial heap dump analysis under https://github.com/
ztellman/aleph/issues/394 . If you've already experienced this kind of
issue, your help is greatly appreciated.
I'll continue digging into aleph code to try to get a better
understanding of how it may happens.
Best regards,
Reynald
--
You received this message because you are subscribed to the Google Groups "Aleph" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.