Describe the bug Not sure if it's a bug or feature. But after upg

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Memory leak in maps containing FakerContext as a key,about datafaker-net/datafaker

Comments (30)

asolntsev commented on July 30, 2024 1

@bodiam @snuyanzin I don't know the background, but static fields in FakeValuesService seem strange to me, especially REGEXP2SUPPLIER_MAP.

They use RegExpContext as a key, and it's quite unclear when two RegExpContext instances are equal or not.
Also, nobody ever deletes unused records from REGEXP2SUPPLIER_MAP - it also seems like a potential cause of OutOfMemory errors.

Maybe these fields in FakeValuesService should be non-static?

from datafaker.

asolntsev commented on July 30, 2024 1

@akefirad Now I see you are mixing two tools solving the same problem. DF also provides test repeatability etc.
But yes, I agree that DF should work properly in your demo application as well.

from datafaker.

snuyanzin commented on July 30, 2024 1

REGEXP2SUPPLIER_MAP

thanks for looking, it might introduce some impact however seems to be not the root cause.

I played a bit with this demo and if I increase amount of tries to let's say 100k then there is OOM.
Checks across the versions shows that for 1.6.0 there is no OOM and for all the version starting 1.7.0 there is OOM

So there was some change introduced such behavior between 1.6.0 and 1.7.0... and it was not REGEXP2SUPPLIER_MAPsince it appeared much later

Feel free to go deeper, i'm not sure I will have time before the weekend to look at it

from datafaker.

snuyanzin commented on July 30, 2024 1

i guess the reason is support of .doWith functionality which needs this kind of cache
i'm going to play with it on weekends to see what can be done here

At the same time giving the fact that this issue was raised only after almost 2 years it was merged I wouldn't say it is a blocker

from datafaker.

vitaly-ivanov commented on July 30, 2024 1

remove randomservice from cache
1.2. Since this is also a tricky step to have it done in one go then first make it forCLASSES in net.datafaker.providers.base.BaseFaker
1.3. then for PROVIDERS_MAP.
Here it would make sense to note that together with decreasing amount of elements in maps now there will be one element per locale. It will lead to flaky tests. The flakyness comes as a result of ordering execution since providers have a link to randomservice and each time there will be invoked nextInt.
The simplest way to fix flaky tests would be disabling of parallel tests on junit level

If I understand the proposed solution correctly, subsequent calls to doWith(theSameSeed) will lead to subsequent calls to Random.next() methods and end up with different results. This looks like a breaking change and not an obvious behaviour in general.

If currently the subsequent calls to doWith(seed) don't use the cache because the FakerContext is always different, is the cache needed at all? Maybe it would be enough not to add the new values to the submap of PROVIDERS_MAP when using doWith(seed).

I don't have knowledge of the codebase and all the use cases, but this is an important case for us and it will be great if the current functionality is not broken.

from datafaker.

asolntsev commented on July 30, 2024 1

@snuyanzin I think I've found a solution. Will submit a pull request today later.

from datafaker.

snuyanzin commented on July 30, 2024

Have you compared profiler output for both cases?
What are the peaks?

from datafaker.

snuyanzin commented on July 30, 2024

In the long run (maybe as feature request?) can DF provide a better way to natively integrate with jqwik?

here there are volunteer people who earn nothing for DF support, so there is no road maps or so.
However if you want to improve something, then you're welcome to contribute otherwise wait when someone from contributors is also interested in solution for your issue and will solve it

from datafaker.

akefirad commented on July 30, 2024

here there are volunteer people who earn nothing for DF support, so there is no road maps or so.

Sure, I understand how OSS works. If it sounded demanding I didn't mean it that way. There's a feature-request type while filing an issue, that's why I was starting the conversation. Happy to help with anything.

Have you compared profiler output for both cases? What are the peaks?

The details are in the issue I mentioned above. Here's a summary of the discussion:

In version 2.2.2 90% of the tests time is spent in DF's net.datafaker.internal.helper.COWMap.put(..)...
It has to do with the random object handed into the Faker constructor.
In version 2 of DF this random object will be used to calculate a hash value (through some detour of RandomService) for the COWMap above. If a new instance of random is handed in in each generation step the COWMap will grow and get quadratically slower. This is why adding/removing injectNull(..) makes such a difference.
In DF version 1 objects were not stored in COWMap so this did not happen.

from datafaker.

snuyanzin commented on July 30, 2024

It has to do with the random object handed into the Faker constructor.

how may Faker objects are created during this demo?

from datafaker.

akefirad commented on July 30, 2024

I’m not sure about the exact number, but many, thousands if not more.

from datafaker.

snuyanzin commented on July 30, 2024

is there a reason? Each object has its own COWMap and since all of them are used for the same locale and same providers in your case it is the main reason.

I would first rewrite the code in a more optimal way to have just one Faker object which by itself each time will generate new data unless there is a good reason to have each time new object

from datafaker.

snuyanzin commented on July 30, 2024

Also demo in java would be much more helpful

from datafaker.

akefirad commented on July 30, 2024

Also demo in java would be much more helpful

In general or do you need it for this issue?

from datafaker.

snuyanzin commented on July 30, 2024

for this issue

from datafaker.

akefirad commented on July 30, 2024

There you go!
demo-java-jqwik-datafaker-v2.zip

from datafaker.

akefirad commented on July 30, 2024

is there a reason? Each object has its own COWMap and since all of them are used for the same locale and same providers in your case it is the main reason.
I would first rewrite the code in a more optimal way to have just one Faker object which by itself each time will generate new data unless there is a good reason to have each time new object

That's how jqwik works. There's not much one can do. This is not my area of expertise, so not sure all the details, but the summary is that jqwik has to generate many random objects to do its job and it is not possible to have a single Faker object, otherwise the test won't be deterministic. (This has been discussed in the other issue.)

from datafaker.

snuyanzin commented on July 30, 2024

This looks very strange: one lib generating a random generator for another library generating a random value...

What is the use case or final goal which is intended to be solved with this approach?

from datafaker.

akefirad commented on July 30, 2024

So we have a lot of tests using PBT technique, provided by jqwik. There’s a lot of pieces of test data that we could/should use jqwik to generate, e.g. numbers, timestamps, etc. However for some of our domain models we need some pieces that jqwik cannot provide data (well it can, but it’s very ugly because it’s pure random bits), e.g. customer name and email. That’s where DataFaker shines, (jqwik provides more than just random data, it’s a PBT framework).
And the way to glue the two (jqwik and DataFaker), at least as of now, is what I’ve done which seems to be totally impractical in V2.
Hope that clarifies.

from datafaker.

snuyanzin commented on July 30, 2024

Here it is the code to use collection functionality of DF which generates 10k TestCustomer in less than 1 sec

@Property(tries = 1)
    void testCustomer(@ForAll("customers") TestCustomer customer) {
        var faker = new Faker();
        System.out.println(faker
                .collection(() ->
                        new TestCustomer(faker.internet().emailAddress(), faker.name().fullName(),
                        faker.phoneNumber().phoneNumber(),
                        new TestCustomer.Address(faker.address().fullAddress()))).len(10_000).build().get());
    }

I think you need to figure out how to deal with these 2 libs in your project

from datafaker.

bodiam commented on July 30, 2024

Just curious, but which exact versions of Datafaker have you tried?

from datafaker.

akefirad commented on July 30, 2024

I'm using v2.2.0, version 2.1.0 seems to be no different. But version 2.0.0 seems to be worse in orders of magnitude (at in the sample project).

from datafaker.

asolntsev commented on July 30, 2024

@akefirad You can speed up your tests very easily. Just use a single instance of Faker instead of creating multiple instances:

class FakerArb {
    private static final Faker faker = new Faker();

    static Arbitrary<Faker> faker() {
        return Arbitraries.randomValue(random -> faker).dontShrink().withoutEdgeCases();
    }

from datafaker.

akefirad commented on July 30, 2024

@asolntsev, right, but unfortunately that doesn't play well with PBT; the above solution eliminates the repeatability of the tests. Another solution would be to hard code the random see for DF, but then that hurts the randomness of the tests. Another solution would be to use a single instance of Faker but generated by jqwik for the first time and used during the whole testing. But that has some other issues (multithreading, as well as low randomness and low repeatability). In short, in PBT, randomness should be managed by the platform/framework (in this case jqwik) not other libraries.

from datafaker.

akefirad commented on July 30, 2024

two tools solving the same problem.

Not really. The objective is to do PBT so it’s jqwik (DF is a random data generator library.) I’m trying to use DF in my test where it shines; generating seemingly real test data. So it’s more like DF is a complementary library for jqwik.

from datafaker.

asolntsev commented on July 30, 2024

@snuyanzin Thanks, it makes the scope narrower.
I debugged with DF 1.7.0, and the OOM is caused by BaseFaker.PROVIDERS_MAP.
It contains very many instances of FakerContext (which contain same locale, but different instances of RandomService).
At least, I see a problem in the fact that RandomService is used as a Map key, but doesn't implement equals method.

NB! @snuyanzin It seems to me that this caching (is it caching?) like PROVIDERS_MAP and REGEXP2SUPPLIER_MAP is not needed at all. What exactly goes wrong if we just remove these maps? Reading local files is fast; what exactly do we want to optimize? Maybe it's enough to just cache Pattern instances?

from datafaker.

bodiam commented on July 30, 2024

@asolntsev What exactly are you proposing? To stop caching and read files multiple times? Unless I'm misunderstanding what you're are saying, that doesn't seem like a great idea to me. Reading the files is quite slow (say, 1000-10.000 times slower than reading from memory).

I've disabled this cache just to see the performance difference, and issue759Test would take 5.6 seconds, instead of 500-800ms, and makes the total test take around 40 seconds. So, while I'm not sure if removing this cache really has an effect on reading the YMLs (I think it doesn't), it does affect the getting the methods using reflection, and the instantiation of the provider classes. But I think the files are only read once the values are used, not when providers are instantiated.

from datafaker.

snuyanzin commented on July 30, 2024

NB! @snuyanzin It seems to me that this caching (is it caching?) like PROVIDERS_MAP and REGEXP2SUPPLIER_MAP is not needed at all. What exactly goes wrong if we just remove these maps? Reading local files is fast; what exactly do we want to optimize? Maybe it's enough to just cache Pattern instances?

@asolntsev If we are talking about something like application for tests then usually it is fast enough even without caching
However if we try to use it for prod like data generation like described at DiUS/java-faker#663
then without caching it is super slow...

from datafaker.

asolntsev commented on July 30, 2024

However if we try to use it for prod like data generation like described at #663 then without caching it is super slow...

Thanks, it answers my question.
Then yes, probably caching is needed.

BUT then it should be fixed. At least, please consider these two comments: #1263 (comment) and #1263 (comment). We definitely have a problem with caching.

For me, it's very hard to understand what exactly is being cached, what serves as a key. :(

from datafaker.

snuyanzin commented on July 30, 2024

yes, thanks for the info it was helpful
not sure i'm able to submit everything within one iteration.

Here are some findings (mostly for myself to not lost the track)
In theory this should lead to the fixed behavior

At the same side the downside of the approach is that it will require to remove doWith for specified seed values functionality... So it is a breaking change from this point of view however same result still could be reached with creation a new Faker object and invoking a corresponding methods.

At least the plan is

remove randomservice from cache
1.2. Since this is also a tricky step to have it done in one go then first make it forCLASSES in net.datafaker.providers.base.BaseFaker
1.3. then for PROVIDERS_MAP.
Here it would make sense to note that together with decreasing amount of elements in maps now there will be one element per locale. It will lead to flaky tests. The flakyness comes as a result of ordering execution since providers have a link to randomservice and each time there will be invoked nextInt.
The simplest way to fix flaky tests would be disabling of parallel tests on junit level
Another issue is that FakeValuesServices#resExp during generation of suppliers uses both methods and providers. Providers still have link to locale via context meaning that this could lead to invikation of same suppliers for wrong locales... Probably this should be changed with functions or anything else making it possible to execute methods for providers with different locales.
(It was not an issue before because of multiple copies of objects where each copy was executed only against its own locale)

from datafaker.

Memory leak in maps containing FakerContext as a key about datafaker HOT 30 CLOSED

Comments (30)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent