scipy / scipy-articles Goto Github PK

View Code? Open in Web Editor NEW

28.0 28.0 35.0 11.36 MB

Publications about Scipy

License: Other

Makefile 0.10% TeX 99.90%

scipy-articles's Introduction

https://img.shields.io/badge/powered%20by-NumFOCUS-orange.svg?style=flat&colorA=E1523D&colorB=007D8A

https://img.shields.io/pypi/dm/scipy.svg?label=Pypi%20downloads

https://img.shields.io/conda/dn/conda-forge/scipy.svg?label=Conda%20downloads

SciPy (pronounced "Sigh Pie") is an open-source software for mathematics, science, and engineering. It includes modules for statistics, optimization, integration, linear algebra, Fourier transforms, signal and image processing, ODE solvers, and more.

Website: https://scipy.org
Documentation: https://docs.scipy.org/doc/scipy/
Development version of the documentation: https://scipy.github.io/devdocs
SciPy development forum: https://discuss.scientific-python.org/c/contributor/scipy
Stack Overflow: https://stackoverflow.com/questions/tagged/scipy
Source code: https://github.com/scipy/scipy
Contributing: https://scipy.github.io/devdocs/dev/index.html
Bug reports: https://github.com/scipy/scipy/issues
Code of Conduct: https://docs.scipy.org/doc/scipy/dev/conduct/code_of_conduct.html
Report a security vulnerability: https://tidelift.com/docs/security
Citing in your work: https://www.scipy.org/citing-scipy/

SciPy is built to work with NumPy arrays, and provides many user-friendly and efficient numerical routines, such as routines for numerical integration and optimization. Together, they run on all popular operating systems, are quick to install, and are free of charge. NumPy and SciPy are easy to use, but powerful enough to be depended upon by some of the world's leading scientists and engineers. If you need to manipulate numbers on a computer and display or publish the results, give SciPy a try!

For the installation instructions, see our install guide.

Call for Contributions

We appreciate and welcome contributions. Small improvements or fixes are always appreciated; issues labeled as "good first issue" may be a good starting point. Have a look at our contributing guide.

Writing code isn’t the only way to contribute to SciPy. You can also:

review pull requests
triage issues
develop tutorials, presentations, and other educational materials
maintain and improve our website
develop graphic design for our brand assets and promotional materials
help with outreach and onboard new contributors
write grant proposals and help with other fundraising efforts

If you’re unsure where to start or how your skills fit in, reach out! You can ask on the forum or here, on GitHub, by leaving a comment on a relevant issue that is already open.

If you are new to contributing to open source, this guide helps explain why, what, and how to get involved.

scipy-articles's People

Contributors

Stargazers

Watchers

scipy-articles's Issues

`l-bfgs-b` supports bounds

But this is not reflected in Table 1.

Comment to Address: Length of Background

Reviewer 2:

At a structural overview of the paper, the authors tried to both demonstrate the strengths of open-source and best practices for community projects while also attempting to detail what SciPy is. Both efforts are quite commendable and important but do make the paper a somewhat difficult read that may not satisfy either category of readers. In particular, I am not sure many readers will fully appreciate the Background section, which makes up ~1/3rd of the paper and seems somewhat misplaced as the beginning section. As is this structure will put off many readers.

...Overall the paper is well written, but some additional attention to presenting this towards a general audience should be exercised. In particular, the “Background” section should be greatly minimized and moved to another part of the paper (or removed entirely) so that readers can be first presented with capabilities.

The journal does not think that the background section should be removed, but that it could be streamlined.

Discrepancy in # of dependent packages / GitHub repositories

In the abstract:
"161 dependent packages, and 28700 dependent repositories"

In SciPy today:
"Over 85,000 GitHub repositories and almost 5000 packages depend on SciPy."

Thanks @jni (#152)

figure out the right wording for (sub)package & (sub)module)

We now use subpackage, package, module and submodule very inconsistently in various parts of the paper. Let's go through once we have completed most/all sections and do this right.

Analysis of apparent conflict between linguist and pytest-cov/gcov data

linguist's report that SciPy 1.0 is ~50% Python / 50% Other appears to conflict with with Figure 3, which suggests that Python makes up much less of SciPy 1.0.

This discussion began in #99. It was attributed to (reasonable) differences in the way the respective tools generated the data. I started to take a closer look.

I used linguist with nearly default¹ settings to calculate the number of bytes of code in each language and with the --breakdown option to generate a list of all the files it counted. To see whether linguist is excluding important files, I cross-referenced this list with the list of all files in the SciPy directory (according to os.walk). Here are the extensions of all files that linguist did not count:

{'', '.example', '.conf', '.ini', '.md', '.py', '.css_t', '.sh', '.npy', '.npz', '.1', '.sample', '.mat', '.png', '.dat', '.scipy', '.idx', '.x', '.svg', '.css', '.README', '.arff', '.src', '.rst', '.less', '.0rc2', '.bat', '.yml', '.nc', '.in', '.toml', '.info', '.2', '.patch', '.0', '.mp3', '.ogg', '.pack', '.js', '.wav', '.json', '.0rc1', '.pyf', '.txt', '.sav', '.0b1', '.html'}

One of the hypothesized reasons for the apparent conflict was that linguist tries to exclude vendored code. This could make linguist under-count the amount of compiled code, and thereby over-count the relative amount of Python code. I don't recognize any of the extensions above to correspond with compiled code², so I don't think that's happening.

The other hypothesized reason for the conflict was that linguist counts the amount of each language in terms of bytes whereas pytest-cov and gcov report the number of lines. I used the file list from linguist to count the number of lines in each file to get the number of lines in each language:

{'Python': 282567, 'Fortran': 162212, 'C': 131534, 'Cython': 20082, 'C++': 14231, 'TeX': 1479, 'Objective-C': 152, 'MATLAB': 181, 'Shell': 107, 'Makefile': 50}

Python files make up 46~47% of SciPy whether you're counting bytes or lines, so that's not it either.

I wondered whether blank lines and comments could be accounting for the difference, so I separately counted blank lines and lines beginning with `#':

{'Python': 215938, 'Fortran': 157945, 'C': 112317, 'Cython': 14902, 'C++': 11559, 'TeX': 1345, 'Objective-C': 135, 'MATLAB': 139, 'Shell': 86, 'Makefile': 29, 'Blank': 76765, 'Comments': 21435}

That helps.
I made a crude pass at counting docstrings separately by stopping the count when there is an odd number of ''' or """ in a line:

{'Python': 141528, 'Fortran': 157945, 'C': 112317, 'Cython': 11257, 'C++': 11559, 'TeX': 1345, 'Objective-C': 135, 'MATLAB': 139, 'Shell': 86, 'Makefile': 29, 'Blank': 76765, 'Comments': 21435, 'Docstring': 78055}

That helps, too.

But Figure 3 reports that the number of Python lines is only 106,878, which is still a lot less than the 141,528 I'm counting when I exclude Cython and try to remove comments, blank lines, and docstrings.
It also reports that the number of compiled code lines is 462,574, which is still a lot more than the 330,028 I count when I include everything (even blank lines, etc.) but Python.
Clearly there is still something we're missing, and I'm getting less confident about the numbers shown in Figure 3.

Thoughts?

¹ By default, linguist counts all Python and Cython code as Python. I changed one setting: count *.py files as "Java". Afterward, I changed the language labels (Python -> Cython, Java -> Python).
² The uncounted .py files are in the documentation, and there is a single uncounted .sh file.

Texify Benchmark Figure

This is a dummy issue to ask the opinion of everyone. It is completely OK for me if we close it without merging. It is also nice for quick archiving for later. I was exceptionally busy this week so couldn't keep my promise to @mdhaber but it's late then never I guess.

Here is the Texified figure using the data in CSV format for the benchmark figure about 100kB in total.

obtained via

\begin{tikzpicture}
\begin{axis}[
    ymode=log,
    width=14cm, height=5cm,
    date coordinates in=x,
    xticklabel={\year},
    xtick={2009-01-01 00:00,
           2010-01-01 00:00,
           20...-01-01 00:00,
           2017-01-01 00:00},
    xmin=2008-01-01 00:00,
    xmax=2018-06-01 00:00,
    ymin=0.001, ymax=1,
    log ticks with fixed point,
    grid=both,grid style={line width=0.25pt, gray!20},
    xlabel={Commit Date},
    ylabel={Execution Time (s)},
    legend entries={$m=3$, $m=8$, $m=16$}, legend cell align=left, legend pos=outer north east,
    extra x ticks={2008-11-10 17:49:08,2012-08-27 00:00,2015-05-15 17:20:07,2017-10-16 20:28:24},
    extra x tick style={grid=major,grid style={line width=1pt, gray!50, dashed},
                        ticklabel pos=top,align=center, text depth=1ex},
    extra x tick labels={{\texttt{cKdTree}\\introduced},
                         {\texttt{cKdTree} Fully\\Cythonized},
                         {\texttt{cKdTree}\\rewritten in C++},
                         {Scipy 1.0\\released}}
    ]
    \pgfplotsinvokeforeach{0,1,2} {
        \pgfplotstableread[col sep=semicolon]{static/csvout0#1.txt}\table
        \addplot+[only marks, mark=*,mark options={draw=none, mark size=1pt}]
            table[x=date,y=value] {\table};
    }

\end{axis}
\end{tikzpicture}

This needs the additional line \usepgfplotslibrary{dateplot} in the preamble somewhere after pgfplotstable package is loaded.

The following files need to be added under static

csvout01.txt
csvout02.txt
csvout00.txt

Citation for paper "the astropy project" is shortened in a strange way

As of now, this is citation 141. The author as "The astropy Collaboration" as first author name. In the citation this is treated like a given name and shortened to "Collaboration, T. A." which does not make sense. It might be sufficient to replace the "and"s in the bibtex file for that entry with semi-colons. For example, how "linguist developers; the open source community" is formatted. On the other hand, I suspect that that's not journal style either and should be "linguist developers, open source community" with a , instead of ;.

Let me know if you want me to fiddle with the bibtex and turn this into a PR.

authorship for the JOSS draft paper

Everyone who contributed to scipy releases 0.18.0 or 0.18.1 is invited to be a co-author on the draft paper, gh-4.

We follow the usual ICMJE criteria for authorship:
(http://www.icmje.org/recommendations/browse/roles-and-responsibilities/defining-the-role-of-authors-and-contributors.html#two)

Substantial contributions to the conception or design of the work; or the acquisition, analysis, or interpretation of data for the work; AND
Drafting the work or revising it critically for important intellectual content; AND
Final approval of the version to be published; AND
Agreement to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.

Here contributions are defined via the commit logs, and substantial contributions are defined broadly: anything bigger than a very simple one-line documentation fix is substantial.

Anyone who satisfies and is willing to accept these responsibilities and commitments is welcome to be an author on this publication. For this, please add your name and affiliation (and, optionally, an ORCID ID if you have it), either

as a pull request to gh-4, or
as a comment to this issue.

In case of a dispute which cannot be resolved otherwise, the final decision is taken by the corresponding author on consultation with the core development team.

Possible repetition of facts about STScI in close proximity?

Right before the SciPy begins section:
"Initial work on numerical computing in Python was driven by graduate students, but soon larger research labs became increasingly engaged. For example, Paul Dubois at Lawrence Livermore National Laboratory (LLNL) took over the maintenance of Numeric and funded the writing of its manual[19], and in 1998 the Space Telescope Science Institute (STScI), which was in charge of Hubble Space Telescope science operations, decided to replace their custom scripting language with Python[20].

The third paragraph of SciPy begins:
"At this point, scientific Python started attracting more serious attention; code that started out as side projects by graduate students has grown into essential infrastructure at national laboratories and research institutes. For example, starting in 2000, STScI ported their IDL-based analysis pipeline, IRAF, to Python."

I can't tell whether the facts about the 1998 replacement of custom scripting language to the 2000 porting of the analysis pipeline are distinct. They are both prefaced with the idea that work started by graduate students become important at national labs, and they appear within half a page of one another, so I thought I should check.

In any case, I propose combining the two into the third paragraph of SciPy Begins. As I'm doing a "final" readthrough before release for wide review, I'm holding back my thoughts on erroneous commas etc., but I think this is important enough to bring up.

47% of (Python?) machine learning projects on GitHub import SciPy

I think that belongs somewhere!
https://github.blog/2019-01-24-the-state-of-the-octoverse-machine-learning/
I'll submit a PR when I can, but it's fine if somebody beats me to it.

Using reST, LaTeX or Overleaf?

There were 5 votes for plain LaTeX on the coordination committee, one +0 for reST, @stefanv suggested reST as used in the SciPy conference proceedings on the mailing list. Overleaf can be used by people on top of a git repo with LaTeX, however none of the people who responded particularly liked it - can revisit that later.

Comment to Address: Future of SciPy / GPU/ distributed ecosystem

I think @mdhaber will go ahead and add the majority of the reviewer comments from Nature Methods in anonymous fashion for the community to help address, as separate issues. Here's a first set of comments to address, grouped together because they are similar requests that we've been specifically asked to address by the Editors.

Reviewer 1:

I would like a longer discussion about the future of SciPy particularly SciPy’s lack of the two features that are starting to become understood as essential to scientific computing: built-in automatic reverse-mode differentiation and heterogenous computation (e.g. CPU, DSP, GPU, etc.). Will SciPy adapt to support these increasingly vital features? Is it even possible without a substantial rewrite?

Reviewer 2:

Distributed and GPU is considered out of scope but is conversely is believed to be required in the future for data and scientific analysis. A discussion of alternatives and how you interact with other ecosystems should be considered (e.g., RAPIDS, Dask, etc). How does SciPy progress with the rapidly evolving hardware?

Word ("from"?) missing in description of special

The "special" item on page 6 has a first sentence where, I think, a word is missing. I would write "The name comes from the class ..."

ff84a98 pushed to main repo instead of fork

I made two mistakes:

Forgot to branch before making changes/committing
Pushed to the main repo instead of my fork

I know how I could have recovered from the first mistake on its own, but now that I've pushed I am not sure what is best. Perhaps these changes are fine; I've only modified @antonior92's work and my own; I've asked him to review. In any case, I wanted everyone to know that the intent was to push to my repo and submit a PR; I'll do that next time.

Details on second occurrence of Pearu P.

scipy-articles/scipy-1.0/paper.tex

Line 282 in df63951

Around the same time, Pearu Peterson, a PhD student from Estonia,

, Pearu Petersen is described as a (former) "PhD student from Estonia". This mention might be more appropriate at the first occurrence of his name (two lines above).

Coverage plot dips

The % coverage plot from codecov has a bit of strange activity (dip / spike) near the start of the 6 month time period--would be helpful to at least briefly explain why this is in the associated Figure caption.

Clean References

The references section needs a fair amount of cleanup - stuff like sticking to capitalization and abbreviation conventions, adding access dates to URLs, etc.

Remove "Distributed computing... is explicitly out of scope"

Per today's call, we should remove this bit about distributed computing being out of scope.

Additional Content Ideas

Is it already decided that the current form should be the final one (besides minor improvements like typos / typography / minor image changes)?

If not, I would have two ideas:

Displaying project structure (e.g. as nested boxes - the size could represent the number of code lines). This is done in the "Package organization" part already, but maybe an image could be more interesting.
Dependency-Graph: Showing a couple of packages that use scipy. According to github, 116000 at the moment. The size of the other packages could depend on how many stars / forks / dependencies those have. The message this could convey could be "this is our eco-system and scipy is a fundamental part of it". Maybe this could even be connected with the project structure, showing how scipy enables others to do interesting stuff.

Recent technical improvements to scipy.optimize

Since there have been many recent contributors to scipy.optimize, I think a distributed approach to drafting this section would work best. Once we collect components, those who are interested can work to edit this section to make it fit with the rest of the paper. Of course, contributors would understand that their components may need to be edited and might not even make the final paper due to space constraints, but that we'd really appreciate their input in any case.

I've quickly reviewed closed PRs labeled scipy.optimize from the past few years, looking mainly at those prefixed BUG, ENH, or WIP, and picked out a few that seem appropriate for the paper. Please forgive (and point out!) any omissions or incorrect attributions.

@nmayorov #5556, #5147
@felixlen #7292
@larsmans #5158
@surhudm (or @nmayorov) #6493
@andyfaff #4191, #3446, #5605
@pv #4392 (and potentially #6769 / #5536 if you think they're significant)
@person142 #5557
and of course
@antonior92 #6919, #7165
myself, #7123

Here is a draft about the new linprog interior-point method. We might decide we want more or less about different topics - or even that we want something pretty different from what I have below - but I thought it would be helpful to give something as an example.

_A new interior-point optimizer for continuous linear programming problems, linprog with method='interior-point', was released with SciPy 1.0. Implementing the core algorithm of the commercial solver MOSEK [1], it solves all of the 90+ Netlib LP benchmark problems tested [2]. (Add something about speed benchmarks?) Unlike some interior point methods, this homogenous self-dual formulation provides certificates of infeasibility or unboundedness as appropriate. (Add something about infeasible/unboundedness benchmarks?)

A presolve routine based on [3] solves trivial problems and otherwise performs problem simplifications, such as bound tightening and removal of fixed variables, and one of several routines for eliminating redundant equality constraints is automatically chosen to reduce the chance of numerical difficulties caused by singular matrices. Although the main solver implementation is pure Python, end-to-end sparse matrix support and heavy use of SciPy's compiled linear system solvers --- often for the same system with multiple right hand sides due to the predictor-corrector approach --- provide speed sufficient for problems with tens of thousands of variables and constraints.

Compared to the previously implemented simplex method, the new interior-point method is faster for all but the smallest problems, and is suitable for solving medium- and large-sized problems on which the existing simplex implementation fails. However, the interior point method typically returns a solution near the center of an optimal face, yet basic solutions are often preferred for sensitivity analysis and for use in mixed integer programming algorithms. This motivates the need for a crossover routine or a new implementation of the simplex method for sparse problems in a future release, either of which would require an improved sparse linear system solver with efficient support for rank-one updates._

[1] J. Dongarra and E. Grosse, “Netlib," [Online]. Available: http://www.netlib.org/.
[Accessed 20 February 2018].
[2] Erling D. Andersen and Knud D. Andersen. “The MOSEK interior point optimizer for linear programming: an implementation of the homogeneous algorithm.” High performance optimization. Springer US, 2000. 197-232.
[3] Erling D. Andersen and Knud D. Andersen. “Presolving in linear programming.” Mathematical Programming 71.2 (1995): 221-245.

Comments about this approach, the PR list, and the draft above are welcome.

Comment to Address: Briefly Introduce GitHub

Page 5. A very brief mention of what GitHub is should be stated; bench chemist may be unaware of this platform.

Comment to Address: Scientific Software Funding

Reviewer 1:

Scientific software is depressingly underfunded. I imagine the economic cost of SciPy is in the tens (if not hundreds) of millions of dollars yet, as you mention, the actual direct financial contributions were a magnitude (or two) less. How did the SciPy community address this disparity? Are there lessons to be learned for the larger scientific community?

Figure Legends

Figure legends are limited to 350 words.
"References should be numbered sequentially, first throughout the text, then in tables, followed by figures; that is, references that only appear in tables or figures should be last in the reference list."
Maybe nothing needs to be changed, but we need to check this carefully.

conda download stats

Looks like we can now access conda download stats: https://github.com/ContinuumIO/anaconda-package-data

We hardly need more content, but may be fused the with PyPI numbers we have.

Review of text on stats module

I reviewed the text on stats:

I agree with @tylerjereddy (see #65
) that the text in paper.tex could be shortened. From my point of view, the following are the main points if we speak about distributions in stats:

more than 100 distributions available (~80 continuous, 12 discrete and 10 multivariate)
consistent framework that implements methods to sample rvs, evaluate the pdf / cdf and to fit parameters for every distribution
generally relies on specific implementations for each distributions, otherwise defaults to generic methods (such as deriving the cdf from the pdf via numerical integration)

Questions / ideas to shorten the current text:

comment on random_state: is this needed? I assume most users expect that they can provide a seed.
multinomial: same as random_state, probably expected that this exists. note: "only multivariate" contradicts statement that there are 12 multivariate
rv_histogram: do we want to sell this as an important feature? advantage / use case probably not clear from the text

When mentioning other features in subpackages.tex, the scope is quite narrow:

statistical tests, including Pearson's correlation, Spearman's rank-order correlations, Kendall's tau, chi-squared test and its generalization as the Cressie-Read power divergence, contingency table tests including Fisher's exact test and Mood's median test, and many more; and assorted transformations and statisticsof data

we could mention more generic test classes such as testing correlation, equality of the mean / median / variance, whether a sample follows a certain distribution etc and give a few well-known examples. (the ones given now sucha s Mood and Cressie-Read are not very well-known, just my opinion)
what about KDE?
statistical distances?

Happy to get your view.

Consider adding Debian popcon (popularity contest) statistic

ATM only the number of downloads from pypi and conda as an indicator of popularity. But some could argue that they are very biased by all CI deployments and do not reflect actual user base installations. Another additional statistic is Debian popularity contest which is voluntarily, opt-in, mechanism to report installed packages on the system. The guestimate is that 1-10% of Debian users enable popcon submissions. What matters there is not the overall number of installations but the % of all Debian systems where scipy is installed. ATM it seems to be somewhere above 4% systems: https://qa.debian.org/popcon.php?package=python-scipy

Move Tables to End of Document

"Please submit tables at the end of your text document (in Word or TeX/LaTeX, as appropriate)."

https://www.nature.com/srep/publish/checklist

authorship policy

What I wrote on the scipy-dev mailing list on Jan 20th:
Authorship: anyone who made a substantial contribution in the history of the project. Here "substantial" is interpreted as anything beyond a one-line doc fix. Rationale: better to be too inclusive than exclusive. Sign-up via a web form, we send the link to that form to all email addresses in the commit history till v1.0.

Author order (details tbd by committee):

The SciPy Developers
Maintainers, paper writers, other key contributors - in order of contribution level
All other authors - alphabetically ordered

Comment by Pauli:
For the authorship policy, to my knowledge there's no established
standard what to do, and maybe the journal editors have an opinion, and
maybe we can look what other projects did. The contributor list to the
releases is likely 600+ people, many of those indeed typo fixes, and
probably don't exceed authorship threshold. If we want to do manual
pruning, we can in principle split the list to "Authors" (who did
something that when summarized in a single sentence sounds substantial)
and "Contributors" (the rest, who are thanked in the acknowledgements).
However, in some fields of physics, the author list tends to list
everyone and grow big, e.g. the experimental Higgs paper [1] has 5000+
authors, the author list is summarized to (XXX collaboration).

[1] https://dx.doi.org/10.1103/PhysRevLett.114.191803

For example, here's the Physical Review guidelines for authorship:

The paper represents original work of the listed authors.
All of the authors made significant contributions to the concept,
design, execution, or interpretation of the research study.
All those who made significant contributions were offered the
opportunity to be listed as authors.
All of the listed authors are aware of and agree to the submission of
this manuscript.

The last constraint I think is important and appears in most
guidelines. We probably don't have the current contact information for
many of the contributors, and it's likely we won't receive "sign off"
statements from several of them.

SciPy Begins Issue

There were a few issues in the SciPy Begins section not fixed by #105:

There are a lot of uncited facts in this section. Is that OK?
There are a ton of names mentioned in this section. Do we want that?
The paragraph beginning "As SciPy, the algorithms library of the ecosystem..." gives context to SciPy, but isn't really about the SciPy library. Does it deserve so much space?

design web form author info and invite co-authors

No action yet, to be done after we have decided on authorship policy and have a first draft of the paper ready.

Complete Author List

After authorship decisions are complete, we need to add the authors to the manuscript.

According to #10 the order is:

The SciPy Developers

Maintainers, paper writers, other key contributors - in order of contribution level

All other authors - alphabetically ordered

I think that means that 1 and 2 will be listed on the front page and 3 is the definition of the SciPy Developers consortium:
"In a separate section at the end of the manuscript (after the ‘References’ section) under the heading ‘Consortium’, the names of each consortium member should be listed."
https://www.nature.com/srep/journal-policies/editorial-policies#author-responsibilities

Paper submission

@mdhaber, @tylerjereddy, @rgommers Has the paper already been submitted? Or otherwise, is there anything left still to be fixed?

Figures should be vector format

I have most of them as vectors; just need to replace the .pngs.

"For optimal results, all line art, graphs, charts and schematics should be supplied in vector format, such as EPS or AI, and should be saved or exported as such directly from the application in which they were made. Please ensure that data points and axis labels are clearly legible."

Cover Letter

We need a cover letter.

@tylerjereddy You've mentioned that you know our editorial contact, Syma Khalid. Did you plan on writing the cover letter or should I draft it? I was hoping we could submit by Wednesday.

Scope and structure of SciPy 1.0 paper

EDIT: tentative structure and authors:

Original description

We need to decide on the topics/structure/scope of this paper.

My suggested list was:

Technical topics (relatively new and noteworthy features):
- Sparse matrices (present & future). Or data structures (include cKDTree, lots of improvements went into that)?
- LowLevelCallable
- cython_blas/lapack/special
- optimize
- stats distributions
- benchmark suite
- Key issues: ndimage pixel vs point, sparse arrays, splines, fftpack vs. np.fft, under-maintained submodules, ...?
Project/package topics:
- History (see 1.0 release notes)
- Language preferences (Python, Cython, C, C++, Fortran) and why. Numba, Pythran? Idea: measure familiarity with these for all core devs
- Roadmap, governance, CoC, scipy.org and SciPy ecosystem (also, packages depending on SciPy?)
- Making development easier (CI, runtests, Bento, Docker, etc.)
- Maintainers, growth of team over time, commit log stats
- Versioning, evolution of API & ABI

Comment by @pv:
I think the harder questions then is the story the paper wants to tell.
The challenge here is I think that "Scipy 1.0" consists of bits and
pieces, and unlike more field-specific software projects the common
theme may be more difficult to state. There's also the previous "Scipy
paper", and presumably we want to write a "followup" or "update"?

One possible story could then to be (which also seems what was already
suggested) write about what went on in the recent years with the
project. In this case the focus would be on the new stuff, and we'd
just state what existed before that in the introduction or short
background section, and then the body of the paper would proceed on
particular things (e.g. as in the list of topics Ralf gave) that we
want to talk about.

Multiple affiliations OK?

I noticed that some authors have multiple affiliations. If that is OK, I'd like to add
"Space Dynamics Laboratory, 1695 North Research Park Way, North Logan, UT 84341, USA" to my affiliations.

For the rest, I think the spatial package could use a clearer summary in the project scope section. The whole sentence looks a bit off, actually

and while sparse.csgraph or spatial offers basic tools for working with graphs and networks compared to a more specialized Python library likeNetworkX.

Might work better as

and sparse.csgraph and spatial offer basic tools for working with graphs and networks compared to specialized libraries like NetworkX.

EDIT: As an historical note, I think SciPy 2008 was first where we actively sought presentations from outside the core developer community, there was a mailing list thread on the topic that I was involved in.

Suggestions for further improvement

First of all, I like this paper a lot. It is really very interesting and very well written. Thanks to everybody who has contributed to it.

Apart from a few smaller fixes collected in #208, I have a few suggestions for (small) changes.

On page 3, "to solve and minimize nonlinear equations" does not make sense to me because equations cannot be minimized. An option could be "to solve nonlinear equations and minimize nonlinear functions" even though I admit that the text becomes more clumsy.
On page 4, it may not be clear to every reader what event is referred to in sentence 3 of the third paragraph in the section "SciPy matures". After all, the previous paragraph talks about documentation. How about replacing "The event" by "The SciPy conference"?
On page 7, the last couple of lines in the first paragraph of the section "Project scope" seem broken. I do not think that a construction "while ... compared to ..." is possible. A simple solution could be to drop "while".
In table 1, the only line in the table not referenced in the legend (apart from the references which are self-explained) is "Line search (LS) or trust-region (TR)". This line might deserve being mentioned in the legend.

Comment to Address: Technical Details

Page 2. A clear distinction between CPython and Python is stressed, which might not be relevant to many audiences as alternatives like PyPI are likely not well understood. This is also an example of where the paper may delve too much into technical details.

set up paper build

We need:

a Makefile to build the paper
TravisCI or CircleCI to rebuild the paper on merges into or pushes to master.
Uploading the built pdf to a fixed location, and link that from the README.

References in Table 1

The references in the last line of the Table 1 could be more nicely formatted using the \citen command from the cite package (already loaded by the wlscirep class currently used).

Direct use makes the table wider, and scaling could be appropriate
\newcommand{\inlinecite}[1]{\footnotesize\citen{#1}}
in header, and
\inlinecite{nelder_simplex_1965, wright_direct_1996}
at

scipy-articles/scipy-1.0/scipy-optimize.tex

Line 73 in 91afe65

 References & \cite{nelder_simplex_1965, wright_direct_1996} & \cite{powell_efficient_1964} & 

Revisions Tracking Issue

Based on a full read-through of the manuscript on January 11/ 2019, my detailed revision notes/ check-list is below & of course open to adjustment, etc. Some of the objectives of the revisions at this stage include unifying the paper to a "coherent voice" / style, looking at concision / focus due to its length, and considering the submission requirements of the journal Scientific Reports as well.

format our manuscript for submission based on the Journal format requirements:
"Scientific Reports publishes original research in one format, Article. In most cases we do not impose strict limits on word counts and page numbers, but we encourage authors to write concisely and suggest authors adhere to the guidelines below."
- Articles should be no more than 11 typeset pages in length
- the main text (not including Abstract, Methods, References and figure legends) should be no more than 4,500 words.
- The manuscript text file should include the following parts, in order:
  - a title page with author affiliations and contact information (the corresponding author should be identified with an asterisk).
  - The main text of an Article can be organised in different ways and according to the authors' preferences, it may be appropriate to combine sections.
  - Figure Legends (these are limited to 350 words per figure)
  - Tables (maximum size of one page)
- Footnotes are not used.
- authors may choose to incorporate the manuscript text and figures into a single file up to 3 MB in size in either a Microsoft Word, LaTeX, or PDF format - the figures may be inserted within the text at the appropriate positions, or grouped at the end.

Notes / check-list as I read through the paper:

Introduction [my general thought on the intro section is that it is rather anecdotal & would benefit from at least bolstering with some more citations -- i.e., cite conference proceedings instead of just mentioning social events / community & so on; also may consider effectively fusing with Background, which is also anecdotal]
- add citation / link for the published SciPy-importing LIGO scripts mentioned
- in the first paragraph we should immediately add some kind of citation for NumPy & explain briefly what it is for the audience of the journal
- citation for "...backed by millions in funding, and thousands of highly qualified engineers..."
- citation for "...entire ecosystem of related packages..." &
- "...a variety of social activities centered around them" (cite conferences?? it is done a bit later, but either do it first here or remove)
Background [anecdotal, but mostly with appropriate citations -- some subsections are noted to need some development and / or cleanup + one or two are abrupt changes in direction from the prose that precedes them]
- first paragraph -- one thought here: this is a scientific journal, so we could mention that Python is a dynamically typed language & also cite the most popular CPython implementation with the highly-cited Python - C API reference manual produced by Guido
- "A number of these packages were written by graduate students and postdoctoral researchers to solve the very practical research problems that they faced on a daily basis."
  - try to bolster this anecdote with example project / package citation?
- "Wingware’s C++ Visualization Toolkit" --> should be Kitware now, I think, and cite?
- cite Project Jupyter at first mention?
- "NumPy 1.0 was released in October 2006"
  - this text renders in red but has no period or citation -- misplaced?
    - perhaps this needs a one or two-sentence expansion -- NumPy became the new standard for the array & is now ubiquitous
- we only now mention that Python is "interpreted," from Paul Dubois quote -- these language specification details (interpreted, dynamically typed) for Python should be mentioned early on, I think
- "In addition to these longer introductory articles,"
  - repeat on "In addition" from last sentence
- the tense of the prose in the first paragraph of the "SciPy Matures" subsection feels a little off
- "special issues are organized and published in a leading scientific magazine,"
  - cite it?
- prose tense issues continue a bit in subsequent parts
- "While the US SciPy conference kept growing (by ... it had multiple tracks"
  - this needs to be filled in or removed
- "When Paul edited the CiSE..."
  - let's use surnames consistently in the paper I think -- there are a lot of authors / contributors, even if we might assume from the context...
  - what is CiSE?
  - this sentence needs a cleanup in structure too
- "...NumFOCUS organizes a global network of community driven educational program called PyData"
  - maybe conferences / meetups instead of or in addition to program?
  - certainly program should be plural at least?
- the "Project Scope" subsection is a bit of a sudden switch away from the anecdotal material that preceeds it for much of the background
- "...SciPy provides what one expect to find..."
  - cite the other projects mentioned in this sentence, at least if they haven't already been cited before in the paper
  - plural expects
  - the sentence is then abruptly followed with an unlabeled list, and this needs to be cleaned up
- Current status (maturity, users) subsection
  - this subsection looks messy / undeveloped with question marks and ellipses & then just mentioning the timeline figure which isn't even cited in LaTeX
Architecture and implementation choices
- Submodule organization
  - for constants -- should we mention our policy on frequency of updates? I saw an issue about this recently with some metric things being refined I think; maybe a minor point though
  - would it be neat / useful to add the years the subpackages were first present in SciPy?
  - Is this subsection more appriopriately presented as a Table proper to condense the content? The extended list of definitions in journal paragraph format may be awkward -- conversely, the Table might get pretty tightly packed
  - citations for wrapped packages like FITPACK & ODRPACK mentioned in this subsection?
    - we might as well, they're already cited in the language choices section below
- Common Infrastructure
  - This subsection is blank -- do we want to keep it?
- Language choices
  - should maybe cite linguist library repo
- API and ABI evolution
Key technical improvements (last 3 years)
- Data Structures
  - cKDTree ~~[this section needs major revision / clarification / TODO items completed / some material moved to SI?]~~
    - can we really get away with inline discussions that include the assumption that the reader will understand something like "np.arange(max(k))"? because we have that at the moment here
    - the second and third paragraphs are extremely technical -- we may have to simplify this a little bit; it is hard for me to read & wrote some of it too--the other problem with presenting this clearly is that the original authors of the code have limited time to write clear prose about what they did and what algorithms they used
    - sentence starting "The cKDTree module implements a dual tree counting algorithm..." is a bit sudden / list-like
    - There's a TODO-style sentence in this section with "Add a Figure to show the scaling, before and after. perhaps give an example or some formula. (cite / mention faster implementions of paircounting algorithms / treecorr, corrfunc)"
      - again, as noted above, clear & concise presentation of this information would benefit from consultation of original authors and / or close & careful study of the source code
  - sparse matrices
    - provide citations for those performance improvements (Pauli's asv website or??)
    - the section is otherwise short & sweet, which is good
- Unified bindings to compiled code
  - LowLevelCallable
    - probably okay, albeit fairly obscure discussion outside Scientific Python community
- Cython bindings for BLAS, LAPACK, and special
  - this is a subsection of "technical improvements" -- if this was one of the things added in last 3 years we should clarify that somewhere in the discussion -- i.e., switch language from "Scipy includes... " to "Scipy now includes..."
  - otherwise, the section is quite readable, I think
- Numerical optimization
  - I suspect we should remove the background on what scipy.optimize does with the list of 7 problems -- we should find a way to make this relatively clear & abstract that information to earlier in the paper when the 16 SciPy submodules are described
  - Just focus on improvements -- can even remove sentence "Documentation of SciPy’s functionality in each these areas can be found in (cite SciPy documentation)..."
    - Linear optimization
      - well-written: indicates new algorithm implemented, what it does, and what can be done to improve in future
    - Nonlinear optimization
      - Local
        
        Table 1 is nice; the sentence citing it should mention that it includes some of the more recent additions since we're still in the umbrella of key / recent technical improvements
        
        that said, Table 1 also includes additions from as far back as SciPy 0.6 -- which is a longer time span than "last 3 years" for the improvements section as a whole...
        
        ~~what about i.e., a dashed vertical line / highlight of some sort dividing "last 3 years" and "before that" in the Table?~~
        
        The Table caption is very detailed, but I'm a little concerned about its size -- could we cut it down a bit and move some of the heavy details to i.e., supporting information? [entire Table now in SI]
        
        I think we do have to move some of the context in the paragraphs here away so that only "last 3 years" / new stuff is emphasized and the older things are either abstracted to the supporting info or moved to the table summarizing what the 16 SciPy submodules do
        
        for example, Nelder-Mead, Powell and COBYLA are discussed but from version 0.6 or earlier, so maybe best relocated
        
        there's a paragraph starting "One recently-added trust-region method is trust-exact..." -- this is the kind of content we want here
      - Global
        
        it is not clear to me which parts of the discussion here fall into the relevant "last 3 years" improvements category, and which are better placed in SI or a previous discussion of what SciPy can do more broadly
- Statistical distributions
  - this can be condensed (maybe?) to only include the recent improvements -- there's too much material about what scipy.stats does more broadly, I think; that material should perhaps either be moved to SI / the general Table, or somewhere else?
- Polynomial interpolators
  - "UnivariateSpline and splrep/splev combo" -> combination
  - there's also a fair bit of background for this subsection before we get to the new stuff, but perhaps it is sufficiently convoluted that at least some background is justified
  - I suspect we could make some effort to clarify old vs. new here & to migrate some of the material that is more "user guide / reference" like elsewhere
- Test and benchmark suite
  - I suppose it isn't fully clear which testing / performance things are from the last 3 years vs. much older
  - Benchmark suite
    - cite asv library repo maybe?
    - the "see above" comment re: unit tests may now be below?
    - shouldn't the "python run.py" command just be "asv run"? maybe it was supposed to be runtests.py or the custom SciPy handler, but using asv directly is probably preferred anyway
    - Fig 4.: I've previously had concerns that it is a little unprofessional to use a screen capture for a Figure in a scientific journal, but I think someone commented that it is ok to show the "true output"
    - "the documentation" should probably be "the asv documentation" in last paragraph
  - Test suite
    - Figure 5: bizarre dips and lack of x-axis label / our inability to modify the figure easily for reviewers seem problematic
      - we could try to regenerate this programmatically using coverage code over the commit history of the project, but that may be a bit of an undertaking to get right
Project organization and community
- Governance
  - seems ok
- Roadmap
  - this still reads like an awkward list -- should it be a Table or fully fleshed out prose?
- Community beyond the SciPy library
  - this is ok; another example might not hurt
- Maintainers and contributors
  - citation / link for SciPy Developer Guide maybe
Discussion
- Still contains TODO-style sentences like "The Discussion should be succinct and must not contain subheadings."
- Impact now
  - Good enough I think
  - ~~[x] the > 13 M downloads Figure may be good to mention in the Abstract as well for "wow" factor I suppose~~
- Future development
  - Condense the PyData sparse content a little maybe & update if appropriate given any recent roadmap / plan adjustments?
  - there is nothing else in this section -- perhaps just refer back to the roadmap then, or combine roadmap and future directions somehow??
Acknowledgments
- TODO
Author contributions statement
- TODO
Additional information
- TODO
- Competing Interests Statement -- contributors with company affiliations that might be perceived as a conflict of interest?
Supporting Information
- TODO / maybe?

journal to submit 1.0 paper to

Suggestions so far:

PeerJ Computer Science (https://peerj.com/computer-science)
Journal of Open Research Software (https://openresearchsoftware.metajnl.com/)

Other options? Preferences?

M87 SciPy usage

The recent excitement about black hole "imaging" from the M87 galaxy may also be leveraged with an appropriate citation of: https://iopscience.iop.org/article/10.3847/2041-8213/ab0c57/meta

SciPy / NumPy are directly cited there

Decide on benchmark to show for ASV subsection

@tylerjereddy has drafted the ASV benchmark section in #15. In #15 (comment), @tylerjereddy mentioned

I'll aim to merge on Sunday ... this doesn't mean that the asv section wont' still evolve or have a different final benchmark(s) in the end.

I think it's fine to merge that PR, but I would like to discuss which benchmark to display. Both @tylerjereddy and I have put in time to benchmark cKDTree and convolve respectively.

Table 1 too small?

https://www.nature.com/srep/publish/guidelines
"Unnecessary figures should be avoided: data presented in small tables or histograms, for instance, can generally be stated briefly in the text instead."
Should we bring this into the body of the text?

ASV Benchmarking Figure

There have been concerns about the quality of the ASV Benchmarking Figure (#65, #117).

Apparently, the data are no longer available and are difficult to generate, so I scraped the data from here for spatial.CNeighbors.time_count_neighbors_deep benchmark and produced:

If you have style tweak suggestions that would be easy for me to implement (in Inkscape), I will consider them. I would welcome suggestions (as RGBA hex codes) for the color palette. I would also welcome a draft caption, as I don't know what this benchmark really does.

I will submit a PR to replace the figure after I find out what the improvements were. If you happen to have ideas about those, please let me know. In case it helps, the dates and commit hashes corresponding with each data point are here.

I have no attachment to this benchmark and I'd prefer to have a denser set of points, so if the data for the original benchmark is found or recreated I'd be happy to swap it in.

Timeline Graphic?

I think Figure 2 should be represented graphically or eliminated.

Getting permission to print logos might be a headache. I suppose we could just lay out the text on a timeline? Anyway, I'm suggesting something other than the current text list.

Mix of mathematical objects and operations ?

The second sentence of the introduction mentions "algorithms for optimization, integration, eigenproblem" (mathematical operations/processes) and "algebraic and differential equations".

scipy-articles/scipy-1.0/paper.tex

Line 120 in df63951

 eigenvalue problems, algebraic equations, differential equations, and many other 

The latter are not operations but objects upon which many different algorithms can operate (zero and fixed-points search, time or time-space integration, inverse problem to determine some optimal initial condition with respect to some criterion, etc...).

Even if the sentence is grammatically correct, shouldn't this be reworded ?

Comment to Address: SciPy Announcement Release Information

Page 3. Figure 1. There should be a date on when the announcement was initially released and potentially medium used.

This refers to the "Excerpt from SciPy 0.1 release announcement":

SciPy is an open source package that builds on the strengths of Python and
Numeric providing a wide range of fast scientific and numeric functionality.
SciPy's current module set includes the following:

    Special Functions (Bessel, Hankel, Airy, etc.) % hanker -> Hankel
    Signal/Image Processing
    2D Plotting capabilities
    Integration
    ODE solvers
    Optimization (simplex, BFGS, Newton-CG, etc.) % Netwon -> Newton
    Genetic Algorithms
    Numeric -> C++ expression compiler
    Parallel programming tools
    Splines and Interpolation
    And other stuff.

repo access levels

We might want to set a policy for read/write permissions to this repository. The simplest option seems to be to have the same access level for the main scipy/scipy code repo and this one.

scipy / scipy-articles Goto Github PK

scipy-articles's Introduction

Call for Contributions

scipy-articles's People

Contributors

Stargazers

Watchers

Forkers

scipy-articles's Issues

Recommend Projects

Recommend Topics

Recommend Org