Fabulate generates data from your specifications. It could be used for filling a test system with dummy customers, items and orders, or just for fun.
You can run some of the included samples using the lein shell script (from the leiningen tool for clojure project automation).
If we take a look at the file named "items.fab" in the samples folder, it contains the following;
prototype item {
id /[A-Z]{2}\d{3}-\d{2}[a-z]/ # a value matching the specified regex
category <Home Office Garden Construction Tools> # a few possible choices
price price [10 1000] # a range of doubles, rounded as price
# a column may include the value of other columns
blurb format "Item %s is a nice %s product for only €%.2f" $id $category $price
}
You can then get Fabulate to generate items according to this specification. In the project directory run the command
lein run -i samples/items.fab csv
Fabulate will now generate a number of sample items for you in csv-format, and output them to the console. For example;
id,category,price,blurb
HP500-68q,Office,844.5,"Item HP500-68q is a nice Office product for only €844,50"
WO299-02g,Office,188.3,"Item WO299-02g is a nice Office product for only €188,30"
OW828-76x,Construction,735.7,"Item OW828-76x is a nice Construction product for only €735,70"
TM275-55y,Tools,234.1,"Item TM275-55y is a nice Tools product for only €234,10"
XZ777-90i,Home,212.6,"Item XZ777-90i is a nice Home product for only €212,60"
CN335-82r,Home,670.0,"Item CN335-82r is a nice Home product for only €670,00"
IH006-11i,Garden,936.5,"Item IH006-11i is a nice Garden product for only €936,50"
GJ371-06k,Garden,360.5,"Item GJ371-06k is a nice Garden product for only €360,50"
UM487-82y,Home,731.6,"Item UM487-82y is a nice Home product for only €731,60"
UL110-77u,Tools,494.9,"Item UL110-77u is a nice Tools product for only €494,90"
If you'd rather output the CSV-data to a file, you can do that as well, see the command line reference.
The Fabulate data generation DSL currently supports the following constructs:
# Anything on a line after the comment sign # is ignored until the end of the line.
A literal is a value representing itself. Simple words need not be quoted. Examples are:
nice # the string "nice"
"very nice" # wrap the string in quotes to include spaces
123.45 # the value 123.45
Angle brackets indicate possible choices.
<Red Green Blue> # one of these three colors will be chosen, equally likely.
Individual choices can also be weighted, if they are to be more or less likely.
<Red:3 Green:20 Blue:7> # one of these three colors will be chosen, according to their weight.
Adding these weights now makes Green the most likely choice (probability 20/30th or about 66%), followed by Blue (7/30th, or about 23%) and then Red (3/30th or 10%). Choices without weights are considered to have a weight of one.
Square brackets indicate ranges of possible numeric values.
[1 10] # yields any decimal value between 1 (inclusive) and 10 (exclusive), equally likely, such as 4.56362721652
Weights can also be attached to the range endpoints, in order to make values from one end of the range more likely than the other.
[1:2 10:8] # yields a decimal value between 1 (inclusive) and 10 (exclusive), but more likely towards the higher end.
The probability for getting the value 1 is not 2/10th however. In order to understand the probabilities for weighted ranges, consider the area under the line segment (1, 2) - (10, 8), which can be seen as a 2 by 9 rectangle, and a right angled triangle with sides (10-1)=9 and (8-2)=6. Intuitively we see that a larger part of the area of such a shape is near x=10 than at x=1, and so Fabulate will be more likely to pick a value near 10.
Once we allow weighted ranges, it makes sense to allow ranges with more than two steps.
[1:2 10:8 12:1]
This indicates two line segments (1, 2) - (10, 8), and (10, 8) - (12, 1). The area under the second line segment is much smaller than the area under the first. This means a higher probability that a value will be chosen from the first part of the range, weighted corresponding to their areas. Within the second range, values are more likely picked near the start of the range (10) than near the end (12), as the start carries a greater weight.
In Fabulate, regexes are not used to match text, instead they are used to generate data matching the regex, using the re-rand library. Regexes are written between slashes, like JavaScript regex literals.
/[A-Z]\d{3}/ # generates values like A637, H762, T013
/[A-F0-9]{8}(-[A-F0-9]{4}){3}-[A-F0-9]{12}/ # generates a GUID-like value
A prototype is much like a table definition; a specification of a collection of related fields. You are allowed to specify multiple prototypes in a single .fab-file, but data will only be generated for the first prototype at this time.
Fabulate also supports nested prototypes, such as in this example prototype named aParent, containing two child prototypes.
prototype aParent
{
num [0 10]
sum add $childA.val $childB.val
working or $done $started
childA {
val [0 10]
done bool
}
childB {
val [0 10]
started bool
}
}
The nesting may continue to arbitrary depths. Note however that, currently, only a single instance of any nested prototype can be generated. Adding array support for nested prototypes is a top development priority. Array support for single fields has been implemented, via the macro function repeat
described later. However, the CSV writer might not currently be able to properly output arrays.
Field cross-references are symbols with a leading dollar sign, possibly with a multi part dot-separated name. A field xref can reference by name any field in the current prototype, as long as no reference cycles are created.
A reference can be done to sibling fields, from a parent prototype to a child field, or from a nested prototype to a parent field. In the above "aParent"-prototype, a reference to val needs to be qualified with the containing prototype, such as $childA.val or $childB.val, or the reference will be ambiguous, but $done and started are adequate references regardless of from where the reference is made (from childA, childB or the parent), as they uniquely identifies a single field.
A field value can use functions to transform the generated data. The syntax is simply a function name followed by one or more parameters separated by spaces. Parameterless functions are not currently supported. A few examples are the use of the functions price and format below:
itemprice price [10 1000] # a range of doubles, rounded as price
text format "Price is %.2f" (price [10 50])
the field "itemprice", calls the "user-defined" function price, which is a ordinary function, written in clojure, doing price rounding. These functions are not defined in the DSL, and at this point needs to be defined manually in the namespace fabulate.dslfunctions.
The field "text" uses two functions, the clojure.core function "format" for string formatting, and also calls the price function on a value in the range 10-50. All arguments to a function are fully evaluated before the values are being passed on to the function (compare Macro functions below).
In general, functions are automatically picked up from the namespaces fabulate.dslfunctions and clojure.core.
Macro functions are used just like ordinary functions in the DSL, but their implementation are quite different, as they hook into fabulate's data generation engine, and can introspect on their definition, including its parameters.
DSL-macros are implemented as ordinary clojure functions in the dslfunctions namespace, but are annotated with {:fabmacro true}. They resemble clojure or lisp macros in several ways, but are not the same thing. The arguments for a ordinary DSL-function are fully evaluated before the function is called, but arguments for a DSL-macro function are not evaluated beforehand. Instead the macro function receives its own definition, including the definition of all parameters (unevaluated), and are thus able to introspect on their definition.
A macro function is then responsible for using its definition to generate a value, likely using the definition of its parameters as input.
As an example, repeat is a macro function that generates a number of values by repeatedly using the definition of its last parameter to generate a new value. The first parameter is used once to generate a value, which controls the number of repetitions.
For example:
repeat [2 5] <alfa beta gamma delta>
has two parameters. The first is a repeat count, and the second defines what will be repeated.In this case "repeat" will evaluate the count parameter once, yielding the number of repetitions, in this case a number in the 2-5 range, say 3. Then "repeat" will evaluate the second parameter, a multiple choice construct, 3 times yielding say "alfa", "gamma" and "gamma", and the resulting list of values will be the return value of the repeat function.
Had repeat been an ordinary function, both parameters would have been evaluated once before repeat was invoked, and it would only be possible to keep repeating the same value.
Pipelines are a way to improve readablility of complex field definitions, by reducing nesting and keeping the apparent flow of information left-to-right instead of right-to-left.
A pipeline rewrites definitions of the form
inner | outer
to outer inner
and can be nested arbitrarily, so
a | b | c | d
is rewritten to d (c (b a))
.
The inner expression is added as the last parameter to the outer function, so adding parameters to our example:
a p1 p2 | b p3 p4 | c p5 | d p6
would be rewritten as
d p6 (c p5 (b p3 p4 (a p1 p2)))
Nesting pipelines in parentheses is also supported, so a (b | c) d
is allowed, and will be rewritten into a (c b) d
.
Therefore the following field definitions are all equivalent;
sort (repeat 5 (price [10 1000]))
sort (repeat 5 ([10 1000] | price))
sort (price [10 1000] | repeat 5)
sort ([10 1000] | price | repeat 5)
repeat 5 (price [10 1000]) | sort
repeat 5 ([10 1000] | price) | sort
price [10 1000] | repeat 5 | sort
[10 1000] | price | repeat 5 | sort
and are all internally rewritten into the first form. However, [10 1000] | price | sort (repeat 5)
is not an equivalent definition, as this would end up calling repeat
with only one parameter, and sort
with two.
Basic syntax is;
lein run <generic-params> <writer> <writer-params>
An example would be
lein run -i samples/items.fab csv --separator=/
There are a few generic parameters valid for all writers
-n --count ROWS Number of rows to generate
-i --input FILE Input file with fab column specifications
-s --select FIELDS Comma separated list of field to include in the output (default: all)
-d --destination FILE Destination file (defaults to outputting data to the console)
And each writer has it own set of parameters. At this point csv and json are the only available writers.
Be warned that error reporting from the command line parser is basically nonexistent. Also, when specifying an unknown writer name, you get a long error message with a stack trace, and the program exits. This will improve at some point in the future.
A writer is a component that knows how to output the generated data in a specific way, such as a specific file format, or connecting to a specific database.
The JSON writer follows the specification on http://json.org/ (using the clojure.data.json library.)
The JSON writer does not have any additional parameters at this time.
lein run -i samples/items.fab json
The CSV writer formats data according to the the RFC4180 specification (using the clojure.data.csv library.)
The csv writer has the following additional parameters, that need to go after the name of the writer.
-s --separator CHAR Field separator to use
If you cannot get the field separator to work as you expect, please be warned that some characters are treated differently by your shell such as semicolon, question mark, asterisk or backslash and may need to be escaped or quoted in order to become a valid parameter value, depending on your specific shell.
Note that at the present time the short argument name "-s" is used both as a generic argument (for selecting particular fields to output) and in the CSV writer to indicate the desired separator character. This works only because you cannot give generic arguments after the writer name argument, or writer-specific arguments before the writer name, so specifying
lein run -i samples/items.fab -s id,price csv -s:
may indeed be confusing to the user, but is perfectly unambiguous to the command line parser.
This just serves to remind you that when developing, one can also try;
lein help
lein kibit run static code analysis
lein ancient check for newer versions of project dependencies
lein midje run tests (once)
lein midje :autotest run tests (on file change)
You'll need to install the relevant lein plugins, of course.
Fabulate's parser is built with the excellent instaparse parsing library.
Copyright © 2013, 2014 Martin Hellspong
Distributed under the Eclipse Public License, the same as Clojure.