Comments (30)
@bodiam @snuyanzin I don't know the background, but static fields in FakeValuesService
seem strange to me, especially REGEXP2SUPPLIER_MAP
.
- They use
RegExpContext
as a key, and it's quite unclear when twoRegExpContext
instances are equal or not. - Also, nobody ever deletes unused records from
REGEXP2SUPPLIER_MAP
- it also seems like a potential cause of OutOfMemory errors.
Maybe these fields in FakeValuesService
should be non-static?
from datafaker.
@akefirad Now I see you are mixing two tools solving the same problem. DF also provides test repeatability etc.
But yes, I agree that DF should work properly in your demo application as well.
from datafaker.
REGEXP2SUPPLIER_MAP
thanks for looking, it might introduce some impact however seems to be not the root cause.
I played a bit with this demo and if I increase amount of tries to let's say 100k then there is OOM.
Checks across the versions shows that for 1.6.0 there is no OOM and for all the version starting 1.7.0 there is OOM
So there was some change introduced such behavior between 1.6.0 and 1.7.0... and it was not REGEXP2SUPPLIER_MAP
since it appeared much later
Feel free to go deeper, i'm not sure I will have time before the weekend to look at it
from datafaker.
i guess the reason is support of .doWith
functionality which needs this kind of cache
i'm going to play with it on weekends to see what can be done here
At the same time giving the fact that this issue was raised only after almost 2 years it was merged I wouldn't say it is a blocker
from datafaker.
remove randomservice from cache
1.2. Since this is also a tricky step to have it done in one go then first make it forCLASSES in net.datafaker.providers.base.BaseFaker
1.3. then for PROVIDERS_MAP.
Here it would make sense to note that together with decreasing amount of elements in maps now there will be one element per locale. It will lead to flaky tests. The flakyness comes as a result of ordering execution since providers have a link to randomservice and each time there will be invoked nextInt.
The simplest way to fix flaky tests would be disabling of parallel tests on junit level
If I understand the proposed solution correctly, subsequent calls to doWith(theSameSeed)
will lead to subsequent calls to Random.next()
methods and end up with different results. This looks like a breaking change and not an obvious behaviour in general.
If currently the subsequent calls to doWith(seed)
don't use the cache because the FakerContext
is always different, is the cache needed at all? Maybe it would be enough not to add the new values to the submap of PROVIDERS_MAP
when using doWith(seed)
.
I don't have knowledge of the codebase and all the use cases, but this is an important case for us and it will be great if the current functionality is not broken.
from datafaker.
@snuyanzin I think I've found a solution. Will submit a pull request today later.
from datafaker.
Have you compared profiler output for both cases?
What are the peaks?
from datafaker.
In the long run (maybe as feature request?) can DF provide a better way to natively integrate with jqwik?
here there are volunteer people who earn nothing for DF support, so there is no road maps or so.
However if you want to improve something, then you're welcome to contribute otherwise wait when someone from contributors is also interested in solution for your issue and will solve it
from datafaker.
here there are volunteer people who earn nothing for DF support, so there is no road maps or so.
Sure, I understand how OSS works. If it sounded demanding I didn't mean it that way. There's a feature-request type while filing an issue, that's why I was starting the conversation. Happy to help with anything.
Have you compared profiler output for both cases? What are the peaks?
The details are in the issue I mentioned above. Here's a summary of the discussion:
In version 2.2.2 90% of the tests time is spent in DF's net.datafaker.internal.helper.COWMap.put(..)...
It has to do with the random object handed into the Faker constructor.
In version 2 of DF this random object will be used to calculate a hash value (through some detour of RandomService) for the COWMap above. If a new instance of random is handed in in each generation step the COWMap will grow and get quadratically slower. This is why adding/removing injectNull(..) makes such a difference.
In DF version 1 objects were not stored in COWMap so this did not happen.
from datafaker.
It has to do with the random object handed into the Faker constructor.
how may Faker objects are created during this demo?
from datafaker.
I’m not sure about the exact number, but many, thousands if not more.
from datafaker.
is there a reason? Each object has its own COWMap and since all of them are used for the same locale and same providers in your case it is the main reason.
I would first rewrite the code in a more optimal way to have just one Faker object which by itself each time will generate new data unless there is a good reason to have each time new object
from datafaker.
Also demo in java would be much more helpful
from datafaker.
Also demo in java would be much more helpful
In general or do you need it for this issue?
from datafaker.
for this issue
from datafaker.
There you go!
demo-java-jqwik-datafaker-v2.zip
from datafaker.
is there a reason? Each object has its own COWMap and since all of them are used for the same locale and same providers in your case it is the main reason.
I would first rewrite the code in a more optimal way to have just one Faker object which by itself each time will generate new data unless there is a good reason to have each time new object
That's how jqwik works. There's not much one can do. This is not my area of expertise, so not sure all the details, but the summary is that jqwik has to generate many random objects to do its job and it is not possible to have a single Faker object, otherwise the test won't be deterministic. (This has been discussed in the other issue.)
from datafaker.
This looks very strange: one lib generating a random generator for another library generating a random value...
What is the use case or final goal which is intended to be solved with this approach?
from datafaker.
So we have a lot of tests using PBT technique, provided by jqwik. There’s a lot of pieces of test data that we could/should use jqwik to generate, e.g. numbers, timestamps, etc. However for some of our domain models we need some pieces that jqwik cannot provide data (well it can, but it’s very ugly because it’s pure random bits), e.g. customer name and email. That’s where DataFaker shines, (jqwik provides more than just random data, it’s a PBT framework).
And the way to glue the two (jqwik and DataFaker), at least as of now, is what I’ve done which seems to be totally impractical in V2.
Hope that clarifies.
from datafaker.
Here it is the code to use collection functionality of DF which generates 10k TestCustomer in less than 1 sec
@Property(tries = 1)
void testCustomer(@ForAll("customers") TestCustomer customer) {
var faker = new Faker();
System.out.println(faker
.collection(() ->
new TestCustomer(faker.internet().emailAddress(), faker.name().fullName(),
faker.phoneNumber().phoneNumber(),
new TestCustomer.Address(faker.address().fullAddress()))).len(10_000).build().get());
}
I think you need to figure out how to deal with these 2 libs in your project
from datafaker.
Just curious, but which exact versions of Datafaker have you tried?
from datafaker.
I'm using v2.2.0, version 2.1.0 seems to be no different. But version 2.0.0 seems to be worse in orders of magnitude (at in the sample project).
from datafaker.
@akefirad You can speed up your tests very easily. Just use a single instance of Faker
instead of creating multiple instances:
class FakerArb {
private static final Faker faker = new Faker();
static Arbitrary<Faker> faker() {
return Arbitraries.randomValue(random -> faker).dontShrink().withoutEdgeCases();
}
from datafaker.
@asolntsev, right, but unfortunately that doesn't play well with PBT; the above solution eliminates the repeatability of the tests. Another solution would be to hard code the random see for DF, but then that hurts the randomness of the tests. Another solution would be to use a single instance of Faker but generated by jqwik for the first time and used during the whole testing. But that has some other issues (multithreading, as well as low randomness and low repeatability). In short, in PBT, randomness should be managed by the platform/framework (in this case jqwik) not other libraries.
from datafaker.
two tools solving the same problem.
Not really. The objective is to do PBT so it’s jqwik (DF is a random data generator library.) I’m trying to use DF in my test where it shines; generating seemingly real test data. So it’s more like DF is a complementary library for jqwik.
from datafaker.
@snuyanzin Thanks, it makes the scope narrower.
I debugged with DF 1.7.0, and the OOM is caused by BaseFaker.PROVIDERS_MAP
.
It contains very many instances of FakerContext
(which contain same locale, but different instances of RandomService
).
At least, I see a problem in the fact that RandomService
is used as a Map key, but doesn't implement equals
method.
![image](https://private-user-images.githubusercontent.com/279773/341170832-dda8912e-bf43-4be0-a0b8-b9e9ecd92755.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjIzMjMwODEsIm5iZiI6MTcyMjMyMjc4MSwicGF0aCI6Ii8yNzk3NzMvMzQxMTcwODMyLWRkYTg5MTJlLWJmNDMtNGJlMC1hMGI4LWI5ZTllY2Q5Mjc1NS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNzMwJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDczMFQwNjU5NDFaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT02ZDdmZWU4MDhlMzI3MDgyMzg3NDRhMGZlZDEwZDFjMTA3M2Y4ZGI2YmM1ZGNjMjc5NDVkMTVjMzg4NDZhZGRiJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.E3F_czTIR_cOs-0C2BmXxXAwZ9C2UkCDIPReA-DFuRU)
NB! @snuyanzin It seems to me that this caching (is it caching?) like PROVIDERS_MAP
and REGEXP2SUPPLIER_MAP
is not needed at all. What exactly goes wrong if we just remove these maps? Reading local files is fast; what exactly do we want to optimize? Maybe it's enough to just cache Pattern
instances?
from datafaker.
@asolntsev What exactly are you proposing? To stop caching and read files multiple times? Unless I'm misunderstanding what you're are saying, that doesn't seem like a great idea to me. Reading the files is quite slow (say, 1000-10.000 times slower than reading from memory).
I've disabled this cache just to see the performance difference, and issue759Test
would take 5.6 seconds, instead of 500-800ms, and makes the total test take around 40 seconds. So, while I'm not sure if removing this cache really has an effect on reading the YMLs (I think it doesn't), it does affect the getting the methods using reflection, and the instantiation of the provider classes. But I think the files are only read once the values are used, not when providers are instantiated.
from datafaker.
NB! @snuyanzin It seems to me that this caching (is it caching?) like PROVIDERS_MAP and REGEXP2SUPPLIER_MAP is not needed at all. What exactly goes wrong if we just remove these maps? Reading local files is fast; what exactly do we want to optimize? Maybe it's enough to just cache Pattern instances?
@asolntsev If we are talking about something like application for tests then usually it is fast enough even without caching
However if we try to use it for prod like data generation like described at DiUS/java-faker#663
then without caching it is super slow...
from datafaker.
However if we try to use it for prod like data generation like described at #663 then without caching it is super slow...
Thanks, it answers my question.
Then yes, probably caching is needed.
BUT then it should be fixed. At least, please consider these two comments: #1263 (comment) and #1263 (comment). We definitely have a problem with caching.
For me, it's very hard to understand what exactly is being cached, what serves as a key. :(
from datafaker.
yes, thanks for the info it was helpful
not sure i'm able to submit everything within one iteration.
Here are some findings (mostly for myself to not lost the track)
In theory this should lead to the fixed behavior
At the same side the downside of the approach is that it will require to remove doWith
for specified seed values functionality... So it is a breaking change from this point of view however same result still could be reached with creation a new Faker object and invoking a corresponding methods.
At least the plan is
-
remove
randomservice
from cache
1.2. Since this is also a tricky step to have it done in one go then first make it forCLASSES
innet.datafaker.providers.base.BaseFaker
1.3. then forPROVIDERS_MAP
.
Here it would make sense to note that together with decreasing amount of elements in maps now there will be one element per locale. It will lead to flaky tests. The flakyness comes as a result of ordering execution since providers have a link to randomservice and each time there will be invokednextInt
.
The simplest way to fix flaky tests would be disabling of parallel tests on junit level -
Another issue is that
FakeValuesServices#resExp
during generation of suppliers uses both methods and providers. Providers still have link to locale via context meaning that this could lead to invikation of same suppliers for wrong locales... Probably this should be changed with functions or anything else making it possible to execute methods for providers with different locales.
(It was not an issue before because of multiple copies of objects where each copy was executed only against its own locale)
from datafaker.
Related Issues (20)
- Breaking DateAndTime HOT 4
- Intermitted expressions failures during testing HOT 8
- Regexify method doesn't recognise "^" and "$" chars in datafaker before 2.2.0. HOT 7
- Remove public methods from `IdNumber` that are not intended to be directly called HOT 8
- How to specify country-specific setting, not language-specific? HOT 2
- Feature request: add support to generate a random PNG/SVG. HOT 1
- 2.3.0 Release HOT 13
- Password generation does not always contain a lowercase letter HOT 3
- Publishing docs works now HOT 1
- Memory leak for seeded random use case HOT 3
- Create a dictionary of obscene words HOT 8
- java.lang.ClassCastException for finance.credit_card provider with Enum argument in Faker expression HOT 5
- Problems with module-info.java with version 2.3.0 HOT 33
- New Project to improving tests HOT 9
- Locale specific generation performance degradation HOT 1
- Fix publishing of snapshots HOT 4
- Code coverage report is broken HOT 4
- Unable to resolve expression during build HOT 12
- Add weighted selection HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from datafaker.