occam-ra / occam Goto Github PK

View Code? Open in Web Editor NEW

20.0 20.0 14.0 4.53 MB

OCCAM Reconstructability Analysis Tools

License: Other

Makefile 1.85% C++ 76.74% HTML 4.45% CSS 0.32% Shell 0.15% C 1.29% Python 14.91% Dockerfile 0.29%

occam's People

Contributors

Stargazers

Watchers

Forkers

gooseus rdjpdx gdcutting andeynunes joefusion venkatachalapathy tim-coutinho shrashan samanf94 maxrp kramer102 percyd bartmassey-upstream bowertree

occam's Issues

Install on nginx

Many folks are using nginx instead of Apache these days. The OCCAM webservice install documentation should be updated to show how to make this work.

Develop Python layer and publish package (capstone project)

The consensus seems to be that the capstone team will be focusing on building out the python layer (taking advantage of the existing python bindings to permit more RA functionality to be accessed from python instead of c++, and perhaps adding some bindings if necessary). This will enable us to provide a functional python package that will hopefully increase the accessibility of OCCAM (we want people to be able to install with pip and maybe conda). This is a big issue that might get split up once we get more active on this.

Webservice should be runnable from alternate port

For those wanting to run the OCCAM webservice on a webserver with existing applications running, OCCAM should be able to run at an arbitrary port. The default port should probably be something other than 80. The install instructions should be updated to show how to configure Apache2 to proxy to the chosen port.

Add Naive Bayes to 'Models to Consider'

"Directed, Search: Add Naive Bayes among Models to Consider.
Explanation & Algorithm: Joe has started coding Naive Bayes. Finish the debugging and implementation of this model type option. The radio button should be to the left of Chain in the Search web input page."

User misfeature (input files)

Marty would like to see some activity that helps users pinpoint problematic input files

Communication guidelines

Generate a brief communications guideline along the lines of the contributor recommendations. I am splitting these into separate issues instead of one guidlines issue (which is too general), so I can close the github workflow guidelines issue.

This guide (not an official github guide, but authored by a GitHub executive) has some good suggestions that could be incorporated.

There has been some discussion starting about whether we need to expand the slack channel structure and what that might look like. Putting together a guideline will help clarify those and other questions.

XSS in weboccam.cgi action parameter

The action parameter is untrusted user input which is directly incorporated into the generated page leading to an XSS.

For example
/weboccam.cgi?action=%3C%2Finput%3E%3Cscript%3Ealert(%22xss%22)%3B%3C%2Fscript%3E%3Cinput%3E

I have not conducted an exhaustive review of the related scripts for similar untrusted user input issues, however it appears there are likely to be more.

Request for Comment: Wikipedia article(s) on RA

As discussed in our meeting, we could think about creating a Wikipedia entry on Reconstructibility Analysis. This issue could be used to discuss what goes in it. I am happy to be part of the team or take the lead on it.

Time limit on searches

Heather's proposal:

On a side note, I thought of an interesting and potentially very
useful feature for OCCAM that I could develop, which is a time limit
on searches. Given that intermediate results of beam search are valid
(albeit not necessarily great) results, it is known as an "anytime
algorithm", or one that you could demand a result from at any point
and get something well-formed. Thus, it is feasible to put time
constraints on the search computation. If this functionality were
implemented, a user could specify a query that said something like,
"Consider state based models with loops, but run for at most 20
minutes" and use it on datasets with too many variables to be feasible
for such a search. Besides taking away the uncertainty as to whether
the search will end in a reasonable amount of time at all, otherwise
intractable search parameters could be used to gain some insight, or
pick better starting points for more complete subsequent searches. A
key advantage of this approach is that it requires no user expertise
or time spent in estimating whether the search being attempted is
feasible based on the variables and their cardinalities, depth and
width selected (and less easily predicted factors like the IPF
iterations to convergence). Instead, OCCAM would just report something
like, "search terminated in 10 minutes by time limit; completed search
through depth of 3/7" or similar along with the results.

Fix long notification time for background job

Marty said that it "takes too long for the browser to tell the user that a background job has been submitted." Look into that and fix.

Algorithms documentation

There is a need to document the RA algorithms (search and fit). I am starting work on this at my fork:

https://github.com/gdcutting/occam/tree/feature/documentation/docs (see any docs with 'reconstructibility-analysis-' in the title)

This will be work in progress for a while. I'm starting with the most general search algorithms and then moving eventually to disjoint, state-based, etc. (more specialized searches).

Also, I know there are a lot of documentation issues right now, but I don't think there's a problem with that. There are a number of different places where we need documentation (tutorial, FAQ, programming-related, wiki, etc.) so these all need to be open.

Generating general and specific lattice of structures

Generate general and specific lattice of structures for small number of variables

I think this was raised by Marty during the 1st March meeting. This might be a good thing overall when we compare similarly enumerated model spaces for other graphical models.

Update training guides/materials

Update the user pdf document and create html tutorials

Add representative DV values for DV bins for calculation of a continuous expected DV

"Directed, Fit: Add representative DV values for DV bins for the calculation of a continuous expected DV.
Explanation & Algorithm: If the DV was originally quantitative and was binned, then an expected calculated continuous DV value can be calculated as <DV|IV> = sum over DV states_i{ q(DV_i|IV) * DVrep_i }, where DVrep_i is the representative DV value for the i-th DV bin. User should be able to specify these representative values by a string of values on the web input page for Fit instead of the "calculate expected DV" line. Have the new line that replaces it read "Representative values for DV bins (optional)" followed by a box in which the user can write values -- e.g., for a DV with cardinality three, the user might input something like "1.4,2.7,3.9". (Reading this in should tolerate white spaces on either side of the commas.) Do this expected value calculation in Fit if and only if the user has specified these representative DV values. It should also be possible to specify these representative values in the Occam input file with the following syntax (the first line indicates that parameters follow in the second line):"

FAQ for occam-ra

If we hope to attract data science, ML or Stat folks, we need to provide some resources for interested contributors. In the absence of well written documentation or tutorial, I suggest that we use FAQ format to answer some of the questions people might have. For example,

Why yet another ML package?
If deep learning can do everything, why do we need this package?
Isn't scikit-learn good enough?
What is so great about occam-ra and graphical models?
What is the relationship between RA and other graphical models?
There are plenty of options for me in CRAN, why pick this?
Who has used this package?
Have they used this in industry?

Update default search report settings

"Make not printing H and dLR their default settings". This should be easy (just unchecking these boxes by default).

Next steps and directions: Speeding up bottlenecks, adding features and all that

I don't know if you all have other things in mind but these are some of the things one could do once occam-ra is readied for general use. These are some of the things I have on my wish list. There are questions regarding implementing more efficient state-space searches that Marty talks about but I suspect those are things that are probably on the pipeline.

identifying computational bottlenecks and speeding up things like IPF and margin calculations (all clearly defined doable tasks) (Are there others like this?)
straightforward extensions like including latent variables and hidden Markov models
and more elaborate extensions like embedding Pearl's causal inference and mediation analysis within occam-ra or creating interfaces so that we can talk to existing R packages that do these jobs

Adopting project Code of Conduct

We have had spirited discussion on the code of conduct issue. After some discussion, we have decided to adopt a slightly modified version of the Contributor Covenant, as follows:

Final discussion and adoption is on the agenda for the Friday 2/8 developer meeting.

Graph not displaying for background fit jobs

"Running Fit now in background does not return the graph of the model,
but one can get the graph by running it in foreground. Fit should be
fixed so it can return the graph along with the numerical output when
the job is submitted as a background job."

Error on selecting an action radio button

Something is broken in action processing with the podman build

Steps:
install podman and start
./podman/build.sh
/podman/run.sh
open http://localhost:8080 in a browser
click "Run Occam"
click any radio button
get ActionError

Github setup recommendations

After spending the last three or four weeks experimenting with my coding/git setup, I now feel like I have identified a lightweight but powerful configuration. Using Atom with plugins (for git integration and tagging/code navigation) and CL git to supplement when necessary. Need to write this up in a concise doc that gives a working recommended setup for new OCCAM developers.

Variable cardinality bug

Per Marty: "Misleading behavior (basically a bug): Occam allows a user to have a variable with cardinality 1, which leads to strange output. Turn off variables with cardinality 1 and output a notice that says 'Variables with cardinality 1 are not proper variables, so variable has been turned off.'

make file based installable and executable OCCAM

I know currently the only way to install OCCAM in Windows is via some virtual box based environment (correct me if I am wrong).

I was wondering if it would be easier to use gnu make to install and execute OCCAM on a shell. I know this is a bit prehistoric but two C++ packages that I've run on my laptop do this, stan and SNAP, the former is a Bayesian statistics package; while the latter is a social network analysis package. Both the input and output are txt files ready for further processing.

Is this something we can do with our current version of OCCAM? And won't this bypass the need to have a server?

If we do this, then one could potentially run OCCAM on Windows via MinGW or cygwin, making it easier.

Even if this is not possible currently, am I right in assuming that once we have a python package version ready in the next few months, we could -at that point- make it easier to run OCCAM on Windows?

Fix memory errors (segmentation fault) in state-based .py files.

sbfit.py and sbsearch.py give segmentation faults, at least for some (large) data files. If a fix is not realistic (might be tough to track down), at least improve error handling so the user doesn't see 'core dump'. Also check to see if this issue is present in web version (weboccam.py)

Search options: specify number of best models

Per Marty: "In Search, there are three criteria for summary best model: dBIC, dAIC, and Incremental Alpha. Right now the summary output just prints out the single best model for each of these three criteria. What I want is to be able to specify the best n1 dBIC models, the best n2 dAIC models, and the best n3 Incremental Alpha models. User should specify these on the Search input page in the Report Settings section. Specifically where it says "Include in Report:" the check boxes for dBIC, dAIC, and Incremental Alpha should be changed to boxes that take an integer number, and the default values for these numbers should be 3 for dBIC, and 1 for dAIC and for Incremental Alpha."

Structural Design documentation and initial recommendations

The programmer-oriented documentation is out of date (15+ years old). After spending three or four weeks with the code, I am realizing that we are in need of documentation that will provide an overview of the OCCAM application structure; references for C++ objects and interfaces (the Design Proposal from 2000 is very limited); the python wrapping, and suggestions for changes focused on division between c++ and python that will make OCCAM easier to repackage and give the user a good experience by virtue of good modular design; and some other comments on overall design, practical considerations, and how to prioritize time. It is also important to introduce some considerations of design and engineering that are not specific to any particular language or platform, but should help inform the high-level discussion of how to perform a structural upgrade and repackaging on OCCAM.

This will help the capstone team in getting up to speed, and facilitate the conversation of the design and engineering issues involved and how to approach the process of making choices about priorities. I am starting to become aware of the fact that investing too much more in the existing framework is probably not a good use of time. Once we update the framework some for proper python packaging, our development efforts (in terms of bug fixes, functionality improvements, new additions) will be more productive, and hopefully we will then have access to a wider community of developers who might be interested in contributing.

I have started working on a document that gives an overview of the application structure, some details about the C++ extensions and python wrapping, overall considerations about design and engineering with a python package in mind, some specific aspects of the current OCCAM implementation that I think need most attention, and some other thoughts about priorities and making best use of time. Hopefully this will help advance the conversation in the period after Marty gets back until the end of the term (as the capstone team is starting to come on and get up to speed) and beyond. I'm going to focus on this for a couple of days and will post a link to work in progress when I'm a little further along.

Fix calculation of p-rule for DV cardinality > 2

"Directed, Fit: Fix the calculation of p-rule when the cardinality of the DV is greater than 2.
Explanation & Algorithm: Right now, p-rule is the p-value from comparing q(DV|IV), the calculated conditional probability distribution of the DV, given the IV state, with a uniform distribution. For |DV|=2 this is fine, but for |DV|>2, it actually isn't adequate. The new algorithm does this comparison with a uniform distribution as a first step, but if the p-value < p-cutoff, and it thus passes this first test, it then does a second test which compares the two highest q(DV|IV) values with (.5, .5), after first normalizing these two highest values so that they sum to 1. Call the p-value for the first test p-value1 and the p-value for the second test p-value2. If the first test fails, Occam should report p-value1. If the first test passes, then report p-value2. Indicate somehow (e.g., with some special character after the p-value) which p-value is being reported."

Take PSU-branded stuff out of web front pages.

Currently there is some Portland State branding in the front (HTML) files. Go through and figure out how to take this out so we're no longer enforcing the appearance from dmm.sysc.pdx.edu.

Triage older issues

We have a bunch of issues from 2019 and earlier that should probably be either bumped or closed as FIXED or WONTFIX.

Data file error in Edge

Edge gives error that it can't read file from the data directory (same file works in Chrome).

Wiki - set up and add content (from user manual and elsewhere perhaps)

I am separating the wiki from the user manual. The wiki is (will be) larger a more comprehensive. The wiki will be an ongoing thing and might take a while. The wiki can link to the user manual but they are kind of separate things. The wiki can have info about RA, OCCAM history, roadmap, etc.

Converting Marty's wishlist into issues

During our meeting yesterday 02/08, Marty mentioned that he has a few dozen items on his occam-ra wish list. It might be a good idea to put them all here as issues and tag them appropriately for all of us to see.

Project contribution guidelines (recommended git workflow)

The CoC is a great step in the direction of establishing guidelines for the project, and helping new developers get oriented. There are some other similar materials that will be helpful to productivity and community relations, and which follow standard open source practice. We will eventually probably want at least:

-communication guidelines
-guidelines for contributors and maintainers
-Git(Hub) workflow guide

and some other relevant material. This will help new contributors get oriented and save time of answering the same questions over and over again.

See related issue (getting started page) at .io repo.

compile error with latest code base

Hello, I'm getting an error in one of the cpp file ( Report.cpp) I was wondering if anyone knows how to fix it.

Fix white-space intolerance in variable names

Per Marty: "Occam is white-space intolerant in the long names of the variables in the
variable block. So if "New York" was a long variable name, the space
between the "w" and the "Y" would cause Occam to fail."

"We could handle this in one of two ways: either Occam could, on reading
in the input file, just ignore (delete) the space (or it could be more
than one space) and store that variable as "NewYork" (and output that in
the Occam output), or this could be treated as a fatal error, and the
user should be told to remove all such white spaces. I'd prefer the
first of these ways, but if for any reason, the second is much easier to
implement, that would be OK."

Convert existing user manual to markdown and post

The existing user manual is a valuable resource and is up-to-date with the current version of OCCAM. Let's convert to markdown and post to the repo wiki.

Data conversion program

We could use a python version of the existing excel data conversion program (https://www.pdx.edu/sysc/sites/www.pdx.edu.sysc/files/RA_bin_Olson%20v2.9_0.zip).

I think the place to start is to make something that can read .csv or .xlsx files, with or without a header, and convert them to the format OCCAM understands.

This wouldn't be too much work (I have quite a bit of existing code from other projects that I could use for this).

Fix basic.py in example scripts

Small changes at lines 83 and 95 will harmonize this with the function definitions and allowable attribute values in ocutils.py and make it work. I was running from occam/install/cl but I believe we need to update occam/py/basic.py which get installed to occam/install/cl when running make install.

Provide pre-compiled binaries

Per Joe's suggestion, we could provide pre-compiled binaries for the c++ object libraries. This will definitely make OCCAM more accessible (a lot of people aren't set up for doing compilation, so providing binaries pre-compiled makes it more likely that someone might try OCCAM out in the current form).

This is easy for linux (one of us that's installed OCCAM on linux just needs to zip the binaries and make a link). OS X will take a little more work but shouldn't be too bad. Windows is more involved (at least for me, since my Windows installation doesn't have any dev tools installed currently).

Flag models with identical search output

"If two different models have exactly the same values in Search output, Occam should flag this somehow, since this is inherently anomalous. It could mean, for example, that two different IVs actually map 1:1 to one another, so one of them should be removed. Something like this could also occur is the cardinality of one of the IVs is 1."

Output confusion matrix and ROC curve

"Directed, Fit, Test data included: Always output confusion matrix and ROC curve.
Explanation & Algorithm: Right now, if |DV|>2, the user has the option of filling in a box with the state to be used as the "negative," i.e., good, state; all other DV states will then be aggregated for purposes of confusion matrix output. If the user doesn't specify anything in this box, a confusion matrix is now not generated. Change this to the following. If the user specifies the "negative" state, use it, but if the user doesn't specify a "negative" state, automatically chose the DV state that has the highest probability in the marginal distribution, and output the state that has been chose. Thus when test data is included, Fit should always output a confusion matrix.
Fit should also output an ROC curve. Right now, Fit outputs the prediction rule determined by whichever q(DV_i|IV) is the biggest for every IV state. This amounts to predicting the "negative" state if its calculated conditional probability is greater than 0.5; otherwise Fit predicts the other, "positive" (bad), state. An ROC curve is generated by changing this 0.5 cutoff to vary from 0 to 1 in small steps."

Publish a container with working image

Now that we have a way to build a working container running OCCAM, we could probably publish the container somewhere and just point people to using that as a quick setup plan? This would mean that non-developers could quickly host OCCAM most anywhere…

Publishing in Journal of Open Source Software

As discussed in our meeting today 03/15, we should aim to submit a paper about OCCAM-RA to The Journal of Open Source Software

I am happy to be part of the team writing this paper. As Bart suggested in an email thread, at some point, we could do a Request for Comment here about this.

Get example py files (command line) working

Get example command line files working. Small errors still in sbsearch.py and sbfit.py (function name mismatches to ocutils.py). See fix of basic.py (similar issues).

Benchmark datasets and case studies

Does Marty have benchmark datasets that we can use to test improvements?
Does he have datasets and analysis that we could provide as case studies?

Most of the well known packages have datasets that are bundled and provided along with the package. These have obvious use for tutorial purposes. But, as we make improvements, it seems necessary to have datasets that we understand reasonably well.

Do we have datasets from when Joe or Heather worked on OCCAM?
Were datasets used during enhancements and improvements ? If yes, then can we make a list?

If yes, they we should upload a few of them for others to use. This along with a few case studies should go a long way towards attracting more contributors and making this more attractive for others to use for academic or industry purposes.

If no such datasets are available, then some of us must work on using publicly available datasets for benchmarking and/or writing case studies.

Keeping working version (at pdx.edu) up to date with repo

We need to decide about how to keep the working version of occam (at pdx.edu) up to date with the repo. I have made a couple of small fixes/changes (default H/dLR search settings, fixes to CL files) that have not been incorporated to the working version yet. That's not a big deal for now, but at some point we'll need to make sure that we're keeping the working version updated with the repo changes. We'll need to be selective about that, at least since the repo version has the PSU branding taken out of it.

I have privileges at dmm.sysc.pdx.edu and dmit.sysc.pdx.edu, but have not been the one to make changes there before, so I don't want to start making any changes before we discuss how to do that. Again, this is not pressing yet but will become important as more changes are made to the code.

Readthedocs.org documentation - set up and configure

I decided to use https://readthedocs.org/ for documentation. It is fully free, is dedicated to hosting documentation, links to your repo, and uses wekbooks to keep the docs page up to date from the repo docs. It's like the GitHub Pages thing - why would we not use this?

Docs set up and initial build complete at https://occam-ra.readthedocs.io/. Default appearance is ugly but I will fix that.