ckrause / loda Goto Github PK
View Code? Open in Web Editor NEWLODA is an assembly language, a computational model and a tool for mining integer sequence programs.
License: Apache License 2.0
LODA is an assembly language, a computational model and a tool for mining integer sequence programs.
License: Apache License 2.0
We currently use 64-bit signed ints as number format. Many of the sequence terms in the OEIS exceed this size. We should check if there are libraries for big number formats or if it is feasible to implement it ourselves.
Currently $1
is the output register.
It will affect all programs.
Multiple outputs: What if someday a second output value is to be returned (rational number, complex number, coordinates). Should it then be placed in $2
. Or some other day, what if 3 output values are to be returned (3d vector).
Multiple inputs: It would be nice with programs that takes multiple inputs such as bitwise operators. So if the input goes into $0, $1
. Then it it's awkward that the output is in the $1
register.
I think $0
for output, is consistent and future proof.
The LODA repo is currently hosted in my personal account (ckrause). If the project should get more contributors, it would make sense to move it to a separate project space on GitHub. Unfortunately github.com/loda
is already taken, but we certainly could find an alternative (maybe change the name as well?).
If you're interested in becoming a regular contributor to the project and you think it makes sense to move it to a separate space, please add comments to this issue.
All generated CSV files in the statistics should have headers:
https://github.com/ckrause/loda/tree/master/stats
It would be much more consumer-friendly if we had prebuilt binaries for different platforms available. We would need to think about how to deliver the programs in this case though.
This will break with the current 3 letter instruction convention. Lots of programs will have to be updated.
For future LODA people it may be more obvious that it's related to OEIS. It wasn't intuitive for me.
It would be great to know more about which program call which other programs. We should generate a new csv file as part of our stats that has two columns: first column is the id of the caller, second the id of the callee. In addition we could generate a DAG visualization using graphviz.
Here a proposal:
Stats
class with a std::multimap<size_t,size_t>
to store the dependency graph.Stats::save
to save it to a new file, e.g cal_graph.csv
or something similar.Stats::updateProgramStats
to pass in the Sequence ID of the programStats::updateProgramStats
to fill the multimap using the new parameter and the second operand of the cal operationsYou can run loda maintain
to regenerate the stats. You should see log message that show the progress. It should take no more than 30 secs. You can abort the rest of the maintenance.
After we generated the new stats, we want to create a graph visualization using graphviz/dot. We want to extend make_charts.sh
to do this automatically.
(This is more a question than issue, but you could also read it as a request for a better documentation of the execution model...)
Recently, when I have hand-coded a few sequences in LODA, I have been told or noticed myself that the minimizer would like to get rid off certain statements in them.
For example, in https://github.com/ckrause/loda/blob/master/programs/oeis/003/A003958.asm#L25
it would like to delete those three statements in lines 25 - 27:
mov $7,$0 ; Check whether $0 has reached one...
cmp $7,1
sub $8,$7 ; If so, then set $8 from 1 to 0 (to stop in the next iteration!)
and in
https://github.com/ckrause/loda/blob/master/programs/oeis/008/A008683.asm#L32
it would like to get rid off those seven instruction on lines 32 - 38, after that mul $1,$7
In A087436, it would delete that mul $4,2 on line 34:
https://github.com/ckrause/loda/blob/master/programs/oeis/087/A087436.asm#L34
which however, speeds the execution (sometimes even nearly twicefold), as in that latter loop we can safely ignore even primes, and concentrate only on finding odd ones, by incrementing the "prime-candidate register" $2 by 2 (instead of 1) at every iteration.
And in A003958 and especially in A008683, the immediate exit from the loop should speed the execution a lot in right conditions (e.g., in latter when the argument is a multiple of 4 with lots of other prime factors > 2 as well).
But, the big "but" here: Have I understood the execution model of LODA machine correctly? I mean, is there any difference to the execution time with such "loop exit hacks" ? At least it is reassuring that minimizer recognizes the results to be identical even when they were removed, which kind of proves that those optimizations are correct in the sense that they don't affect the eventual result returned.
The maintenance job runs very long already. Only at the very end it updates the statistics and the readme lists. We should separate the program checks and minimization from the metadata update. Maybe just update the stats and lists at the beginning of the maintenance.
Here is an instance of somewhat fundamental fault I saw already two weeks ago, but I haven't had time to elaborate it until now.
Following is the miner-generated code in:
https://github.com/ckrause/loda/blob/master/programs/oeis/276/A276153.asm
; A276153: The most significant digit when n is written in primorial base (A049345).
; 0,1,1,1,2,2,1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
mov $2,2
mov $3,3
lpb $0
mov $1,$0
div $0,$2
mov $2,$3
add $3,2
lpe
My immediate first thought when seeing the above code was that, "wow, what a concise way to do it!", and then the second thought about a second later: "but that cannot be right!".
See, instead of really generating primes as 2, 3, 5, 7, 11, 13, 17, 19, ..., the miner cuts some corners, and instead spews out "the slacker's primes": 2, 3, 5, 7, 9, 11, 13, 15, 17, ..., that is, 2 followed by all odd numbers >= 3.
Now, we can think it still computes the most significant digit of some base, and that could be called "A097801-base", for the sequence A097801 [a(n) = (2n)!/(n!2^(n-1))] begins as 2, 2, 6, 30, 210, 1890, 20790, 270270, 4054050, 68918850, ..., i.e., products 2, 23, 235, 2357, 23579, etc. (if it had a(0)=1 instead of 2, the name would be even better fit, because we still want the least significant digit for 1's).
I have submitted this as: https://oeis.org/A341356 ("The most significant digit in A097801-base.")
which is precisely the sequence the miner-code computes in its version of A276153.
The problem is that their first difference doesn't come until at n=1890.
And similarly, in
https://oeis.org/A276150 vs. https://oeis.org/A341513 (sum of digits)
https://oeis.org/A276084 vs. https://oeis.org/A341514 (number of trailing zeros)
also in these cases the first difference doesn't come until at that point.
So, I think here is a good reason to raise the upper limit up to what n the miner will check its results. Also, if the miner detects that there are several b-files in OEIS that do not diverge until at certain points (even after its own current upper limit), it would be still a good idea to check that code on all the diverging points, to deduce which ones of those OEIS-sequences it at least do NOT compute. For example, for these three sequences: A099564, A276153, A341356 the diverging points are at n=210 and n=1890, so checking the miner's program just at those points would immediately tell which two of the sequences it certainly is not.
But... there is a more fundamental issue here: even if we rise the limit, the miner's next innovation might be to compute "primes" as 2, 3, 5, 7, 11, 13, 15, 17, 19, 21, 23, 27, 29, 31, 33, ..., that is, just skip 9, 25, 49, 81, 121, ..., and other odd squares from the list of odd numbers. I would bet this is still easier to find by random mining than to find an algorithm that generates real primes. And in this case the first difference to the true primorial base wouldn't appear until at 2357111315 = 450450. And there's no way we can stop it inventing even better "slacker's approximations of primes" for ever.
Now, one solution to these kinds of problems would be to "pre-empt" such miner-attempts by hand-coding important sequences that it is likely to get wrong, and effectively protect those A-numbers from mining attempts. And that is what I did when I coded manually
https://github.com/ckrause/loda/blob/master/programs/oeis/276/A276150.asm for the primorial base sum,
based on more general algorithm presented at:
https://github.com/ckrause/loda/blob/master/programs/oeis/276/A276086.asm
(from which it is easy to create other primorial-base related sequences).
Now, for the fun, you may also take a look what these do:
https://github.com/ckrause/loda/blob/master/programs/oeis/341/A341356.asm
https://github.com/ckrause/loda/blob/master/programs/oeis/341/A341513.asm
https://github.com/ckrause/loda/blob/master/programs/oeis/341/A341514.asm
(they just call the "primorial base"-versions, so accidentally, A341356 gets the correct result, because by this logic wrong x wrong = right).
Best regards,
Antti
PS. See also the #74 for a related issue.
The OEIS superseeker is impressive. IIRC Niel Sloane has said that an AI enhanced superseeker would be even more awesome.
The sequence database is also really good at derermining the formula
And in a similar way there is the Ramanujan Machine.
I see the LODA mining code and I wonder, is this intended as an automated formula seeker?
Instead of storing the assembly code as text. Does it use a byte code representation of the instructions?
When finding a formula, is the instructions represented as a genome (similar to DNA). What kind of mutations does it do in every fitness iteration?
So many questions. Maybe a video or a podcast where you explain things.
Greetings from a fellow oeis fan.
The idea is to keep multiple generated programs for the same sequence to collect more data that could be used for machine learning. For example, the Sequence Database keeps up to 5 programs for the same sequence.
If the divisor fits into one word, we can use the "long division" algorithm, which should be faster. Benchmark: A1113.
The bin
operation currently supports negative arguments based on the paper by Kronenburg: https://arxiv.org/pdf/1105.3689.pdf
The advantage is that the miner has a higher chance of finding a working and useful program with bin
. On the other hand, the extension to negative arguments is not really trivial and sometimes the miner generates programs where negative arguments are used, but actually "normal" (positive) arguments would also work.
Should we disable negative arguments for bin
and throw an error in that case? This would mean that some programs become invalid and need to be regenerated.
Support alternative eval
modes:
Use $0
as input. Two output cells: $1
is the numerator and $2
is the denominator. So we can evaluate a program to a sequence of rationals.
(2.1) Use (1) to generate a sequence of rationals. Then sum up all terms and take the limit. Convergence not guaranteed.
(2.2) Use (1) to generate two sequences a(n)
and b(n)
. Use these two sequences in a continued fraction (cf. http://www.ramanujanmachine.com/ ).
In theory we should be able to do mining for irrational numbers, like roots, e, pi, and zeros of the Riemann zeta function: http://www.dtc.umn.edu/~odlyzko/zeta_tables/index.html
We currently check 250 terms to determine whether a program matches a sequence. If it contains values exceeding the 64-bit range, we truncate the sequence to be checked.
Recent tests show that there are around 60 programs up to A060000 that have incorrect terms. We should extend the check to more terms to filter out these incorrect programs.
The proposal is: we keep the existing logic for 250 terms and extend it by an additional test: we test also terms up to, say, N=1000, but with the following behavior: we discard the program if we found a difference, but we do NOT discard the program if an overflow occurs in this range.
Suggested new instructions: max a,b
What it would do: select the maximum of a and b and store it into a: a:=max(a,b).
Rationale: Although this can be implemented with trn and some other instructions, having it packed into a single instruction would facilitate coding many of the "take maximum of something" kind of loops that are a reasonably common pattern also in many sequences. And would also increase the chances that the miner will find such meaningful subroutines by itself.
(NB: code like https://github.com/ckrause/loda/blob/master/programs/oeis/008/A008833.asm would probably shorten considerably).
PS. As what comes to min a, b I think it might be good idea to do that (or have another opcode called something like umin) comparison a bit diffrently, as with unsigned integers, with signed value -1 reserved in that case (only) for a value like +infinity. That is, doing mov $1,-1 and then in the loop umin $1,$2 would always at the first iteration overwrite $1 with whatever value of $2 (interpreted as unsigned integer), and etc.
PPS. Forget min for a moment, or do it the standard way. In any case, max is much more used in seq. definitions.
We currently call the wget
command-line tool to fetch files from the Internet. It would be better to integrate libcurl
instead: https://curl.se/libcurl/
Blocker at the moment: we currently use make
for building. If we start consuming libraries, we'll probably need to switch to cmake
instead.
Suggested new instruction: did a,b
What it would do: Divides the target by the source value, but only if the source divides the target: a:= a/b if b|a, otherwise a:=a, making the instruction in the latter case effectively a no-op.
Rationale: Would work great with the loops, where target is also used as the loop-register. Then could be used for example to divide all instances of b out of a.
Consider the current version of A000265:
https://github.com/ckrause/loda/blob/master/programs/oeis/000/A000265.asm
; A000265: Remove all factors of 2 from n; or largest odd divisor of n; or odd part of n.
; 1,1,3,1,5,3,7,1,9,5,11,3,13,7,15,1,17,9,19,5,21,11,...
add $0,1
mov $1,$0
lpb $1,1
mov $2,$1
bin $2,2
gcd $1,$2
lpe
This is not bad, only three statements inside the loop. However, bin $2,2 which computes n*(n+1)/2, makes the effective upper-size limit (in the current 64-bit-limited implementation of LODA) to drop to about half, 31 bits or such. (Please correct me if I have understood this wrongly).
Now, with did-instruction, this would make just 5 statements in total, and with just one inside the loop:
add $0,1
lpb $0,1
did $0,2
lpe
mov $1,$0
A007814 and A007949 could be implemented similarly:
add $0,1
lpb $0,1
did $0,3
add $1,1
lpe
Similarly, for many routines that examine the prime factorization structure of n in general, it would be very useful, and obviate in the most cases the need to separately test for the divisibility with mod-instruction.
Also, we don't yet know what creative use cases the miner would invent for the did-instruction.
Bitwise-operators XOR, OR, AND would very useful for the myriad of crazy binary sequences in OEIS, including also some operations in the polynomial ring GF(2)[X], see e.g. https://oeis.org/A048720 see also https://en.wikipedia.org/wiki/CLMUL_instruction_set )
(Or if we were minimalist purists, then just NAND would suffice, hah!?)
Of course, the set of built-in primitives greatly affects to which particular directions or subsets of sequences Lodaminer will home to.
Please consider miner's code in: https://github.com/ckrause/loda/blob/master/programs/oeis/235/A235918.asm
; A235918: Largest m such that 1, 2, ..., m divide n^2.
; 1,2,1,2,1,4,1,2,1,2,1,4,1,2,1,2,1,4,1,2,1,2,1,4,1,2,1,2,1,6,1,2,1,2,1,4,1,2,1,2,1,4,1,2,1,2,1,4,1,2,1,2,1,4,1,2,1,2,1,6,1,2,1,2,1,4,1,2,1,2,1,4,1,2,1,2,1,4,1,2,1,2,1,4,1,2,1,2,1,6,1,2,1,2,1,4,1,2,1,2,1,4,1,2,1,2,1,4,1,2,1,2,1,4,1,2,1,2,1,6,1,2,1,2,1,4,1,2,1,2,1,4,1,2,1,2,1,4,1,2,1,2,1,4,1,2,1,2,1,6,1,2,1,2,1,4,1,2,1,2,1,4,1,2,1,2,1,4,1,2,1,2,1,4,1,2,1,2,1,6,1,2,1,2,1,4,1,2,1,2,1,4,1,2,1,2,1,4,1,2,1,2,1,4,1,2,1,2,1,7,1,2,1,2,1,4,1,2,1,2,1,4,1,2,1,2,1,4,1,2,1,2,1,4,1,2,1,2,1,6,1,2,1,2,1,4,1,2,1,2
cal $0,71222 ; Smallest k such that gcd(n,k) = gcd(n+1,k+1).
mov $1,7
mov $2,$0
lpb $1
mov $1,$2
lpe
https://github.com/ckrause/loda/blob/master/programs/oeis/236/A236454.asm is otherwise same, but with add $1,1 at the end, which in itself is correct, because indeed A236454(n) = 1+A235918(n).
However, the program in both cases computes a wrong result (after a while), because it essentially computes a(n) = min(7, A071222(n)). [but note that A071222 is an offset-0 sequence, while A236454 is an offset-1 seq].
In the OEIS entry https://oeis.org/A235918 I have added the following comment:
Note that a(n) is equal to A071222(n-1) = A053669(n)-1 for the first 209 values of n. The first difference occurs at n=210, where a(210)=7, while A071222(209)=10.
Now, this diverging point 210 is still inside the range (0..250) of the current miner-settings, so one might think that the issue is not now that the checking-limit is too low. Except that it is, because the first term larger than 7 doesn't come until at n=420 in A235918:
$ gawk -f records.awk < b235918_upto10000_from_OEIS.txt
1 1
2 2
6 4
30 6
210 7
420 10
4620 12
Now, as what comes to the miner's implementation of A071222, it is interesting:
https://github.com/ckrause/loda/blob/master/programs/oeis/071/A071222.asm
as it calls A276084 (Number of trailing zeros in primorial base representation of n (A049345); largest k such that A002110(k) divides n). I think the logic is correct, even though the current implementation of A276084 is not:
https://github.com/ckrause/loda/blob/master/programs/oeis/276/A276084.asm
(it most likely computes the https://oeis.org/A341514 This is the topic of the previous issue, #72 ).
So the main point is: Could we explicitly control how the miner uses such "slacker's hacks" like min(constant, return_value_of_some_other_Axxxxx(...)), as in many cases this will produce incorrect programs?
Also, would that be any easier to do if there were an explicit min-instruction? (Note how it how does this "min" with the help of loop-construct). Or is the best hope just to raise the limit where the program is checked against? Actually, I wonder why couldn't it check the whole b-file after the first 200-300 match, if the terms in b-file stay small?
On the other hand, a construct like min(1, Axxxxxx(n)) is perfectly valid for creating characteristic functions from sequences that give a count from 0 upward for some related phenomena.
Best regards,
Antti
Idea to speed up the LODA programs that have subroutine calls if there is a stored copy of the subroutine program terms.
May require caching per term output values and the number of steps to compute.
LODA programs currently operate on signed integers (datatype number_t
). The expressiveness could be expanded by considering Gaussian Integers instead. This would mean we have a real and an imaginary part and extend the semantics of all existing operations to the domain.
An overview of relevant aspects of Gaussian Integers can be found, e.g., in the tutorial bei Keith Conrad: https://kconrad.math.uconn.edu/math5230f12/handouts/Zinotes.pdf
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.