GC_BRIDGE takes several seconds to run (num_hash_entries 627215) and causes freezes

In our Android application we experience "freezes" on some of GC_BRIDGE calls.
When that happens, we see the following in logs:

[monodroid-gc] GC cleanup summary: 2994 objects tested - resurrecting 151. [Mono] GC_BRIDGE num-objects 2994 num_hash_entries 627215 sccs size 580329 init 0.00ms df1 1722.13ms sort 965.77ms dfs2 5635.07ms setup-cb 226.27ms free-data 1022.01ms user-cb 89.15ms clenanup 5.57ms links 4864549/4864549/194960546/140 dfs passes 8644474/8346351 [Mono] GC_MINOR: (Nursery full) pause 1766.82ms, total 1767.00ms, bridge 7947.04ms promoted 32K major 43360K los 1499K

I suspect, num_hash_entries value (627215) is too big.

However, if we call GC.Collect() manually it's always quick with something like the following in logs:
[Mono] GC_BRIDGE num-objects 197 num_hash_entries 287 sccs size 274 init 0.00ms df1 1.17ms sort 0.33ms dfs2 0.98ms setup-cb 0.16ms free-data 0.17ms user-cb 43.32ms clenanup 1.32ms links 5674306/5674306/218541494/140 dfs passes 10084180/9736021

Going through docs there's very few information about GC_BRIDGE, basically the following: num-objects is the number of bridge objects this pass is considering, and num_hash_entries is the number of objects processed during this invocation of the bridge code..

What is the bridge object (it's certainly something different from GREF's, and according to GREF logging, we have no problems with GREF)? What could increase the number of them to over 600K? Why during explicit GC.Collect() there's much less of them? Is there a way to cause this 'huge' GC_BRIDGE manually?

Posts

  • I noticed, that when I refresh (completely recreate all the data and reconstruct the Fragment), a certain screen in the application, huge GC_BRIDGE happens on every 16th refresh.

    However, if I do explicit GC.Collect() after 14th refresh, huge GC_BRIDGE is postponed (allowing another 16 refreshes). So, by doing GC.Collect() manually from time to time, I see almost no huge GC_BRIDGE's (and GC.Collect is way faster then huge GC_BRIDGE).

    Why is that?

  • YSharpYSharp USMember ✭✭
    edited January 2014

    It is difficult to do better than just guesstimate without knowing about more code context... Especially in the tricky swamps of garbage collection, and considering, to top it off, that two managed environments are involved (.NET + JVM).

    But you did provide a good clue that it's seemingly happening for a specific screen of your application, and for a specific operation (quoting) - "when I refresh (completely recreate all the data and reconstruct the Fragment), a certain screen in the application..."

    I agree that the # of hash entries that you see reported when you have the issue does seem suspect... 600K+ of objects to be tested (*) for reachability by the GC is quite a lot... (I mean, considering that it's only because of a single state transition from the application / user experience point of view).

    Now, I can think of at least one type of data structure in a managed environment that can undesirably proliferate that fast if some piece bit of our code is, say, "lousy" (or relies on some other which is)... and that's : strings.

    (immutable in both Java and .NET, as we know - hence, proliferating as fast as with every single string concatenation, etc)

    So, the best I can suggest at this point is for you to look closely (and try to model mentally, helped by some debugging if you can) about every single thing (runtime operation in local variables, fields, etc) that is implied by your "completely recreate all the data, etc" - in terms of string operations, if you can't suspect other objects.

    Wishing you good hunting (if I dare say)... :/

    'HTH !

    (*) and note: that would be a minimum actually, assuming that it's a perfect hash function being used internally (which is most unlikely) - so, the actual number of objects involved is certainly even larger by a factor of "who knows" (could be times 2, or times 3, or etc)

  • Cyril, big thanks for your response on topic!
    Please, correct me if I'm wrong: the num_hash_entries is related to the number of .net objects Mono GC is going through (so, basically it's close to what we'd see running this app on Windows), it's not related to inter-VM things (GREFs or something) or Java side, right?

    it's seemingly happening for a specific screen

    Unfortunately, it's happening through all the app, but on that specific screen I'm able to 100% reproduce it every 16 "refreshes" (exactly 16, so it helps to see whether there's progress on the problem or not).

    And one more thing: why calling GC.Collect() doesn't behave the same as this 'random' GC_BRIDGEs?

  • YSharpYSharp USMember ✭✭
    edited January 2014

    I'm like you, I'm pretty much only discovering the docs on the topic, besides my general knowledge on how GCs such as in .NET or the JVM work...

    Re-quoting the docs, it states clearly:

    In the GC_BRIDGE message, num-objects is the number of bridge objects this pass is considering, and num_hash_entries is the number of objects processed during this invocation of the bridge code.
    

    And then:

    In general, the larger the value of num_hash_entries, the more time that the bridge collections will take, and the larger the total time spent collecting will be.
    

    So, on this:

    "the num_hash_entries is related to the number of .net objects Mono GC is going through [...] not [...] the Java side, right?"

    ... yes, I would assume the same.

    But on that:

    "And one more thing: why calling GC.Collect() doesn't behave the same as this 'random' GC_BRIDGEs?"

    No useful idea, sorry :/

    I can only say that nothing tells us how exactly the "work load" is computed for that so-called "GC_BRIDGE" and how that compares in the general case to a full collect (by GC.Collect()) that you may trigger manually in your own code. What I mean is, yes, you may have found that in some cases, by choosing yourself the moment to involve the GC thru your app code, it does a better job than letting the system's default strategy "guess things out".

    Of course, that's not something we like to be forced to do (from a code design, maintenance, etc perspective).

    Again, try to look closely at / reconsider what your application is busy doing just before those suspect heavy GC_BRIDGE messages with many hash entries, to try find the pattern of object allocation (I proposed strings, but that could be something else) that might be the culprit.

    If you really spend too much time / run out of ideas of where your code could be "at fault", well, you can always try escalate that to tech support (who will certainly want you to provide the most minimal test case capable to reproduce the issue).

    Good luck! :/

  • ArturDrobinskiyArturDrobinskiy USMember
    edited January 2014

    I found, that GC.Collect() triggers GC_MAJOR with low GC_BRIDGE.

    However, GC.Collect(0) triggers GC_MINOR and that causes GC_BRIDGE with 600K entries.

    That quite puzzles me, because minor collections are meant to be cheap and collect newborn objects, and we definitely don't create 600K objects with 2 refreshes (GC.Collect(0) starts triggering GC_BRIDGE with 600K of num_hash_entries after 2 screen refreshes, and this number - 600K - stays the same if we GC.Collect(0) after 4/5/6... refreshes)

    However, it's possible, that we have total 600K strings/other objects in memory (we have a heavy cache), but we load them when the app starts, and they are not changed.

  • YSharpYSharp USMember ✭✭
    edited January 2014
    However, it's possible, that we have total 600K strings/other objects in memory (we have a heavy cache), but we load them when the app starts, and they are not changed.
    

    Then, to verify this theory, you can try this (even if that makes some of your app features unusable - it's just for the sake of verification, just once):

    try to make it so that you cut the data size in your applicative cache in half (or roughly so)

    Then, do it again to go as low as ~ 25% of data that you cache on app startup.

    If you see the figures in the GC_BRIDGE traces cut in about the same proportions, then you'll have a clue that your cache mechanism is unfriendly (for whatever reason, but that'll be an independent issue) with the standard stragegy of the GC for the versions that you use.

    Next action (for workaround, or bug report, or etc) will be at your choosing. But at least, you'll know.

    'HTH,

  • YSharpYSharp USMember ✭✭
    edited January 2014

    Btw, it's a general advice I cannot recommend warmly enough after 10+ years being busy coding against .NET, and 17+ years in software:

    we always better make it simple and working first, but as soon as our code goes significantly past being trivial (if only in # of SLOC), we also better start measuring early and often.

    Measure, measure, measure. Measure early... and measure often - for the CLR's sake !

    (I believe you already knew, but others may find it useful to know as well, especially with handheld devices)

  • ArturDrobinskiyArturDrobinskiy USMember
    edited January 2014

    try to make it so that you cut the data size in your applicative cache in half (or roughly so)

    You're right, after removing that cache completely and calling GC.Collect(0), we now have
    [Mono] GC_BRIDGE num-objects 613 num_hash_entries 17591

    That confirms the point, that GC_MINOR for some reason goes through all objects that we have in memory, and contrary, GC_MAJOR doesn't do so.
    Is it correct behaviour of GC? Are we the only one that have such a problem?

  • YSharpYSharp USMember ✭✭

    No idea if that's normal or not, but that does feel counter-intuitive, I agree... :/

Sign In or Register to comment.