karlicoss / orgparse Goto Github PK

View Code? Open in Web Editor NEW

359.0 13.0 42.0 192 KB

Python module for reading Emacs org-mode files

Home Page: https://orgparse.readthedocs.org

License: BSD 2-Clause "Simplified" License

Python 99.13% Makefile 0.20% Shell 0.67%

org-mode org python

orgparse's Introduction

orgparse - Python module for reading Emacs org-mode files

Install

pip install orgparse

Usage

There are pretty extensive doctests if you're interested in some specific method. Otherwise here are some example snippets:

Load org node

from orgparse import load, loads

load('PATH/TO/FILE.org')
load(file_like_object)

loads('''
* This is org-mode contents
  You can load org object from string.
** Second header
''')

Traverse org tree

>>> root = loads(''' ... * Heading 1 ... ** Heading 2 ... * Heading 3 ... ''') >>> for node in root[1:]: # [1:] for skipping root itself ... print(node) * Heading 1 Heading 2 * Heading 3 >>> h1 = root.children[0] >>> h2 = h1.children[0] >>> h3 = h2.children[0] >>> print(h1) * Heading 1 >>> print(h2) Heading 2 >>> print(h3) *** Heading 3 >>> print(h2.get_parent()) * Heading 1 >>> print(h3.get_parent(max_level=1)) * Heading 1

Accessing node attributes

>>> root = loads(''' ... * DONE Heading :TAG: ... CLOSED: [2012-02-26 Sun 21:15] SCHEDULED: <2012-02-26 Sun> ... CLOCK: [2012-02-26 Sun 21:10]--[2012-02-26 Sun 21:15] => 0:05 ... :PROPERTIES: ... :Effort: 1:00 ... :OtherProperty: some text ... :END: ... Body texts... ... ''') >>> node = root.children[0] >>> node.heading 'Heading' >>> node.scheduled OrgDateScheduled((2012, 2, 26)) >>> node.closed OrgDateClosed((2012, 2, 26, 21, 15, 0)) >>> node.clock [OrgDateClock((2012, 2, 26, 21, 10, 0), (2012, 2, 26, 21, 15, 0))] >>> bool(node.deadline) # it is not specified False >>> node.tags == set(['TAG']) True >>> node.get_property('Effort') 60 >>> node.get_property('UndefinedProperty') # returns None >>> node.get_property('OtherProperty') 'some text' >>> node.body ' Body texts...'

orgparse's People

Contributors

Stargazers

Watchers

orgparse's Issues

Possible mistake in heading regexp

orgparse/orgparse/node.py

Line 40 in 702faa6

RE_HEADING_STARS = re.compile(r'^(\*+)\s*(.*?)\s*$')

Hi,

There may be a mistake in this regexp. An Org heading requires one or more stars at the beginning of the line, followed by one or more spaces. But this regexp appears to make the spaces after the stars optional. This is important, because a star at the beginning of a line, without spaces afterward, may be the beginning of bold text.

orgparse.load() is broken for file-like objects

Hi,

The docstring says that the argument can be "str or file-like" and I've been using it with the latter. But that got broken by a recent venv update.

Looks like commit 3067189 is the culprit. It added a line

path = str(path) # in case of pathlib.Path

which means that it's now trying to open a file named <_io.TextIOWrapper name='tasks.org' mode='r' encoding='UTF-8'>, which obviously fails.

That line of code looks a bit out of place, or at least unrelated to the commit message. Was it intended to be commited?

-Ben

Question: Why is `_special_comments['TITLE']` a list?

Sorry for asking as an Issue but I couldn't find another "channel" for that.

I am new to orgparse and org syntax. I would like to know why the TITLE entry in _special_comments is of type list? In my cases there is only one entry not more. In what cases would there be more then one element in that list? Maybe you can give an example org file to illustrate that.

>>> orgobj._special_comments['TITLE']
['My Title']

Move `tests` outside the package folder

Your current folder structure looks like this

orgparse
├── doc
│   ├── ...
│   └── source
│       └── ...
├── ...
├── orgparse
│   ├── date.py
│   ├── extra.py
│   ├── __init__.py
│   ├── inline.py
│   ├── node.py
│   ├── py.typed
│   ├── tests
│   │   ├── data
│   │   │   └── ...
│   │   ├── __init__.py
│   │   ├── test_data.py
│   │   ├── test_date.py
│   │   ├── test_hugedata.py
│   │   ├── test_misc.py
│   │   └── test_rich.py
│   └── utils
│       ├── __init__.py
│       ├── _py3compat.py
│       └── py3compat.py
├── pytest.ini
├── README.rst
├── setup.py
└── tox.ini

The problem in short

Your current folder structure looks like this

orgparse
├── orgparse
│   ├── __init__.py
│   ├── ...
│   ├── tests
│   │   └── *.py

Because of that you ship your package (via pypi or distros) always including your unittests. There is no need for this. The package is blown up, resources (data transfer) are wasted and CO2 produced.

Beside of resources it is unusual today. There are also some other packaing and unittest related problems.

The recommended structure looks like this.

orgparse
├── orgparse
│   ├── __init__.py
│   ├── ...
├── tests
│   └── *.py

Today there are some variants of project folder layouts (e.g. the so called src-layout) but all of them have in common that package folder (in your case orgparse/orgparse) is always separted from the tests folder (orgparse/test).

Because you are using tox I am not able to provide you with a PR. I am sure tox can handle that but I assume that tox need to be reconfigured after modifying the folder structure.

There are also some problems with your unittest invocation which I report in a separate Issue (#56).

Document status of date parsing

Is orgparse.date public API? I'd like to use it in my code, but I'm not sure if it's intended as public API, and I wouldn't want to depend on it as library code if if it's not guaranteed to stay around as a stable API (by "stable" I don't mean "not growing" of course: just not mutating existing interfaces, except for carefully considered things like bugs, undefined behaviour...).

If it isn't public & stable, I'll just copy it and it will still be useful :-) -- just not as much as maintained library code of course.

Offer `CONTRIBUTE.md` (was: Unittest not running)

While writing a long Issue text about why unittest isn't running I found out that you use pytest.

Please offer information's like this in your README or in separate CONTRIBUTE.md. How to create a PR; against which branch? Naming conventions about new branches? Code guilelines? etc.?

Date parsing with timezone information

My CLOCK are generated this way:

  CLOCK: [2020-09-09 Wed 09:04 CEST]--[2020-09-09 Wed 09:04 CEST] =>  0:00

Context

Notice the timezone information. I need to be able to parse this timezone information to have non-naive datetimes.
Indeed, I'm travelling as I work, and use these clock information generated in local times all over the world.

They need to be converted then to my customers timezone.

About non-naive datetime

Anyway, it seems quite desirable to handle only non-naive datetime anyway.
Note also that I don't think I've done anything special to change the time format of emacs.

Analysis of code

First, orgparse don't recognize this because of the timezone information that is not expected.
I'm able to fix that, but then, what to do with the timezone information ?
python's datetime is a really big mess and does not know how to solve this alone:

datetime.datetime.strptime
- still creates naive datetime using "%Z"... this is a bug/quirk (see: stackoverflow nice summary)
- could create non-naive datetime using "%z" for some python 3.2+ only.
- anyway "%Z" and "%z" do not know how to handle more than a very few timezone specifier
  - see patch to update doc about %Z not handling much
  - see python issue about %Z not handling much

Otherwise you must create yourself a tzinfo object...

So, the only way to do this correctly is:

Either keep the full "%Y-%m-%d %H:%M %Z" string unparsed (at least it is complete info) and give it to OrgDateClock so that we can do some proper parsing of this info (but then, how do you check duration and I didn't check if you had other code expecting that OrgDateClock were holding actual datetime object).
Either depend on some other external python package to create a full fledged datetime object.

How do you view this problem and are you interested to support full non-naive time parsing ?
I'm able to send you a PR, but won't do it without your approval, as it may imply some decisions about adding a dependency to orgparse.

speedup parsing

Not that it's slow, but making it even faster wouldn't hurt. Or at least setting up some proper benchmarks.

https://github.com/org-roam/test-org-files is a good source of test files

py-spy output from parsing a bunch of files:

Note that iterative parsing (using generators) makes it a bit misleading

_iparse_timestamps appears as a child call of _iparse_repeated_tasks

Tried replacing re with regex (https://pypi.org/project/regex), but didn't have any effect

Tags containing letters outside of a-zA-Z

In file node.py the constant RE_HEADING_TAGS should probably be something like '(.*?)\s*:([\w@:]+):\s*$' to allow for characters like "ä", "č" and so on in tags that are accepted by org-mode without problems.

Otherwise very helpful little library. Thank you for sharing it!

How to access comments and properties in clocks (Loogbook entry)

Hi,

I'm using orgparse to get a list of recent clock entries. However I don't see how to access the properties and comments inside the clock entries.

Here's an example:

* AHU-Tickets
** TODO [#C] a sample ticket with priority, in my AHU project           :AHU_39:
:PROPERTIES:
:assignee: Matthew Carter
:filename: this-years-work
:reporter: Matthew Carter
:type:     Story
:priority: Medium
:status:   To Do
:created:  2019-01-24T23:24:54.321-0500
:updated:  2021-07-19T18:40:30.722-0400
:ID:       AHU-39
:CUSTOM_ID: AHU-39
:type-id:  10100
:END:
:LOGBOOK:
CLOCK: [2022-02-24 Thu 20:30]--[2022-02-24 Thu 20:35] =>  0:05
  :id: 10359
  Sample time clock entry
:END:
*** description: [[https://example.atlassian.net/browse/AHU-39][AHU-39]]
  The summary is here
*** Comment: Matthew Carter
:PROPERTIES:
:ID:       10680
:created:  2019-01-24T23:25:19.455-0500
:updated:  2019-01-24T23:27:36.125-0500
:END:

From: org-jira.

We can see the clock entry:
CLOCK: [2022-02-24 Thu 20:30]--[2022-02-24 Thu 20:35] => 0:05

And I would like to access the properties:
:id: 10359
and the comments:
Sample time clock entry

Is that possible?

Providing OrgEnv to load() with pathlib.Path errors

from pathlib import Path
fp = Path('mypath')
env = orgparse.OrgEnv(filename=fp)
root = orgparse.load(fp, env)

  File "/usr/lib/python3.10/site-packages/orgparse/__init__.py", line 142, in load
    return loadi(lines, filename=filename, env=env)
  File "/usr/lib/python3.10/site-packages/orgparse/__init__.py", line 162, in loadi
    return parse_lines(lines, filename=filename, env=env)
  File "/usr/lib/python3.10/site-packages/orgparse/node.py", line 1447, in parse_lines
    raise ValueError('If env is specified, filename must match')

Just converting path to str during env creation workarounds:

from pathlib import Path
fp = Path('mypath')
env = orgparse.OrgEnv(filename=str(fp))
root = orgparse.load(fp, env)

Non-existant date errors without context

2011-04-31 is not a valid date, april only has 30 days.

testcase:

** test
<2011-04-31 Sat>

leads to:

Traceback (most recent call last):
  File "/home/hrehfeld/projects/2023/topics/orgmode.py", line 31, in <module>
    doc = orgparse.load(filepath, make_env(filepath))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hrehfeld/projects/2023/topics/.venv/lib/python3.11/site-packages/orgparse/__init__.py", line 140, in load
    return load(orgfile, env)
           ^^^^^^^^^^^^^^^^^^
  File "/home/hrehfeld/projects/2023/topics/.venv/lib/python3.11/site-packages/orgparse/__init__.py", line 148, in load
    return loadi(all_lines, filename=filename, env=env)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hrehfeld/projects/2023/topics/.venv/lib/python3.11/site-packages/orgparse/__init__.py", line 168, in loadi
    return parse_lines(lines, filename=filename, env=env)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hrehfeld/projects/2023/topics/.venv/lib/python3.11/site-packages/orgparse/node.py", line 1464, in parse_lines
    node._parse_pre()
  File "/home/hrehfeld/projects/2023/topics/.venv/lib/python3.11/site-packages/orgparse/node.py", line 1151, in _parse_pre
    self._body_lines = list(ilines)
                       ^^^^^^^^^^^^
  File "/home/hrehfeld/projects/2023/topics/.venv/lib/python3.11/site-packages/orgparse/node.py", line 1202, in _iparse_timestamps
    self._timestamps.extend(OrgDate.list_from_str(l))
                            ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hrehfeld/projects/2023/topics/.venv/lib/python3.11/site-packages/orgparse/date.py", line 471, in list_from_str
    odate = cls(
            ^^^^
  File "/home/hrehfeld/projects/2023/topics/.venv/lib/python3.11/site-packages/orgparse/date.py", line 227, in __init__
    self._start = self._to_date(start)
                  ^^^^^^^^^^^^^^^^^^^^
  File "/home/hrehfeld/projects/2023/topics/.venv/lib/python3.11/site-packages/orgparse/date.py", line 238, in _to_date
    return datetime.date(*date)
           ^^^^^^^^^^^^^^^^^^^^
ValueError: day is out of range for month

I'd expect orgparse either to parse the date somehow, or provide context where the error happens. This is probably as easy as augmenting the ValueError with location info.

Repeated_tasks and logbook parsing

It seems like orgparse get logbook drawer into the body.
If I set property drawer and logbook drawer it wiil get logbook drawer as body text.

Test this sample.

DONE header 1 :blog:@Flat:template:
CLOSED: [2021-06-02 Wed 00:48]
:PROPERTIES:
:CREATED: [2021-05-25 Tue 00:33]
:TEMPLATE: test
:END:
:LOGBOOK:

State "DONE" from "SOMEDAY" [2021-06-02 Wed 00:48]
State "SOMEDAY" from "WAITING" [2021-06-02 Wed 00:48]
State "WAITING" from "STARTED" [2021-06-02 Wed 00:48]
:END:
Fish are aquatic, craniate, gill-bearing animals that lack limbs with digits.
next sentence

** DONE subheader 1 :template:blog:
CLOSED: [2021-05-25 Tue 00:51]
:PROPERTIES:
:CREATED: [2021-05-25 Tue 00:33]
:TEMPLATE: test
:END:
:LOGBOOK:

State "DONE" from "CANCELLED" [2021-05-25 Tue 00:51]
State "CANCELLED" from [2021-05-25 Tue 00:51] \
stopped
State "STARTED" from "WAITING" [2021-05-25 Tue 00:51]
State "WAITING" from "STARTED" [2021-05-25 Tue 00:37] \
time
:END:
The earliest organisms that can be classified as fish were soft-bodied chordates that first appeared during the Cambrian period.

*** DONE subsubheader :article:
CLOSED: [2021-05-25 Tue 00:51]
:PROPERTIES:
:CREATED: [2021-05-25 Tue 00:33]
:TEMPLATE: test
:END:
:LOGBOOK:

State "DONE" from "CANCELLED" [2021-05-25 Tue 00:51]
State "CANCELLED" from [2021-05-25 Tue 00:51] \
stopped
:END:
Most fish are ectothermic ("cold-blooded"), allowing their body temperatures to vary as ambient temperatures change, though some of the large active swimmers like white shark and tuna can hold a higher core temperature.

** subheader 2 :blog:
:PROPERTIES:
:CREATED: [2021-05-25 Tue 00:38]
:END:
Fish are an important resource for humans worldwide, especially as food.

then run node.body .
I breaks on double 🔚 and gets logbbok into the body.

QUESTION: Reason about using `codecs.open()`

orgparse/orgparse/__init__.py

Line 135 in 36b31d8

with codecs.open(str(path), encoding='utf8') as orgfile:

I am preparing a little bugfix but stumbled across this line where you use codecs.open() instead of builtin.open() or pathlib.Path.open().

From your current point of view:

Is there a good and strict reason for this?
Are there any reasons against using the usual builtin.open() or pathlib.Path.open()?

If there is no need for using codecs I will show you in a PR what I have in my mind. ;)

support names on tables

In an org-file like this, the table would have tabname as a name in the org-element. It would be helpful to have this information as attributes or something in orgparse.extra.Table.

#+caption: Some caption for a table
#+name: tabname
| x | y |
|---|---|
| 1 | 2 |

Support lists

Thanks for your recent work on orgparse, it looks promising.

It would be great if orgparse supported lists like this:

- one
- two
- three

I was able to port my code over from org-export-json to orgparse by parsing out the items from node.body myself, but I probably did a bad job of it, and that's exactly the sort of core org syntax thing that I'd love to be able to rely on orgparse for.

Logbook drawer tags are not removed from body text

This issue is also mentioned in #38, but haven't been resolved in the latest version 36b31d8.

Test case (repeated task), modified from doctest:

>>> from orgparse import loads
>>> node = loads('''
... * TODO Pay the rent
...   DEADLINE: <2005-10-01 Sat +1m>
...   :LOGBOOK:
...   - State "DONE"  from "TODO"  [2005-09-01 Thu 16:10]
...   - State "DONE"  from "TODO"  [2005-08-01 Mon 19:44]
...   - State "DONE"  from "TODO"  [2005-07-01 Fri 17:27]
...   :END:
... ''').children[0]
>>> print(node.body)
  :LOGBOOK:
  :END:

Test case (clock):

>>> from orgparse import loads
>>> node = loads('''\
... * TODO Clock
... :LOGBOOK:
... CLOCK: [2022-01-01 Sat 00:00]--[2022-01-01 Sat 01:11] =>  1:11
... :END:
... ''').children[0]
>>> print(node.body)
:LOGBOOK:
:END:

Support tables, links etc

see https://www.reddit.com/r/orgmode/comments/d6v0mc/orgparse_is_back_python_library_for_reading/f0wxo14/

version missing

Not sure but it seems that there is no version information available.

>>> import orgparse
>>> orgparse.__version__
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: module 'orgparse' has no attribute '__version__'

I do use 0.4.20231004 from PyPi install via pip.

wrong parsing org file

I use

root = loads("""
* H1
** H2
*** H3
* H4
** H5
""")

for node in root[1]:
   print str(node)

and get

* H1
** H2
*** H3
** H5

instead of

* H1
** H2
*** H3

H5 is not a child of H1, but it shows up in root[1]

Support writing to org files?

Is there any plan to support writing to org files? i.e. creating new org-nodes?

parsing of multiline properties

Now orgparse unable parse multiline properties. Properties where new values/new lines for same items presented as

:PROPERTIES:
:item: value 1
:item+: same item value 2 
:another_item: another value

orgparse thinks that they are different items because it works with just single line properties.
Is it possible to add such thing?

How do i load a file with custom TODO keys?

Hi,
i was wondering how to correctly parse an org file that contains custom TODO keys (e.g. ['TODO', 'WAITING', 'DONE', 'CANCELLED']). It seems supported as indicated by the add_todo_keys() method but looking into the file loading methods, those keys have to be set before actually loading the content and the API does not support that.

Am i missing something?

Thanks!

Scope, aim and future of orgparse

Hi!

Author of this primitive Org mode to Python3 parser speaking. I'd love to replace my stupid parser with a decent one in the future - if possible.

So: what is the goal of orgparse? What is the scope? What is the non-scope? What is the vision? I'd love to read such a small section in your readme file.

For example: will orgparse be able to parse all important Org mode syntax elements such as lists, tables, internal and external links, footnotes, text formatting (italic, underline, bold, ...), and so forth?

Currently, orgparse does seem to store the content of a heading without further analyzing it except various time- and date-stamps.

Support properties in OrgRootNode

I would like to extract properties from the OrgRootNode, a use case of this could be to find the ID of an Org Roam Article

Minor difference in date objects parsed

I found a difference between date object types returned when parsing nodes with only one line or multiple lines.

Eg:

from orgparse import loads
r0 = loads("""* Heading 1
* Heading 2
  body""")

print([c.scheduled for c in r0.children])  # prints [OrgDate(None), OrgDateScheduled(None)]

It's pretty minor and mostly cosmetic. I'm creating a PR to harmonize these a bit.

Date repeat

I'm trying to retrieve the repeat part of a scheduled date but it is not clear if it saved during parsing. It seems the date regex take care of that (cookie part from what I've understood) but then I get lost. On simple example I would like to use:

import orgparse
node = orgparse.loads('''
* Pay the rent
  SCHEDULED: <2020-01-01 Wed +1m>
''').children[0]
time = node.scheduled
# repeat = node.???

Invalid syntax when trying to use the library

I just installed your library as documented using 'pip install orgparser'
and then tried to import 'load' and 'loads' from orgparser in a python shell.
It fails with the following syntax error message:

In [3]: from orgparse import load, loads
Traceback (most recent call last):

File "/usr/local/lib/python3.5/dist-packages/IPython/core/interactiveshell.py", line 3296, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)

File "", line 1, in
from orgparse import load, loads

File "/usr/local/lib/python3.5/dist-packages/orgparse/init.py", line 112, in
from .node import parse_lines, OrgNode # todo basenode??

File "/usr/local/lib/python3.5/dist-packages/orgparse/node.py", line 16
chunk: List[str] = []
^
SyntaxError: invalid syntax

Support more Effort format in Properties

The following code produces error in orgparse 0.2.3:

from orgparse import loads
root = loads('''
* Node
  :PROPERTIES:
  :Effort: 1:23:45
  :END:
''')
root.children[0].properties['Effort']

Error message:

./venv/lib/python3.6/site-packages/orgparse/node.py in parse_property(line)
    136     match = RE_PROP.search(line)
    137     if match:
--> 138         prop_key = match.group(1)
    139         prop_val = match.group(2)
    140         if prop_key == 'Effort':

ValueError: too many values to unpack (expected 2)

Other formats listed below results in similar errors:

1:23:45
1y 3d 3h 4min
1d3h5min
2.35h

The available effort formats are mentioned in org-duration.el. I can parse these formats by adding some tests and modifying orgparse/node.py line140-143. If supporting such format is preferable, I can work on this during my free time and open a PR.

Extracting line numbers from org nodes?

Is it possible to extract the line number from a node? I need this functionality because I want to signal to the user the document is invalid.

Bug in OrgDate.has_overlap

orgparse/orgparse/date.py

Line 319 in becddb1

elif self.start == other.get_start:

There is a bug that comes from some errors while changing from get_start() method to start attribute (made in commit e2c964c )