sefaria / sefaria-export Goto Github PK

View Code? Open in Web Editor NEW

245.0 245.0 161.0 10.95 GB

Structured Jewish texts and metadata exported from Sefaria's database.

License: Other

sefaria-export's People

Contributors

Stargazers

Watchers

Forkers

edenerez cltk avrahituv edamboritz kolodny siddurcloud jerusalemprogrammer sffej musedrokasoft shmulireinitz men770 yizchok mikhms treeingwalker keiderlingt chaim-gluck analyzeplatypus yit770 dinkin amitschotten adamj02090830 biny235 botme do2or ori1234 orelfligelman 355380o726602 nitaihalle angelodel80 avipogrow malkaelie torahbiblecodes artificialchat poupflex slvnd yishairasowsky moblidaa chaimochs meangrape seviavishalom citelao jiuoneone f79722 ecmetin yairopro classicist octokas j-t-rubin cornae jmosesson dbarta pcfranchise dafuki robertfeher001 sternbach-software stupidfudgemuffle ychatzinoff toratah ryanquey matan1988 subokbot jdkdev noahbeitler ninjaminion real-mmg-dev hsm2k3 dovid-moshe-crow yidhub thuonghdbhsoft david702pi joshcampfield gho1b lucasmcortes danbitton123 hlapin condorthe2nd loike97543 solobuzuky pninagr nealmcb ephraimfa eli-handel dlipson joseoscar-gp avi-perl orbedstudios davidwarshawskysjsu ednahafzi tipsyturtle63 amram3133 sergiogman grzybnia mendelg easywordshiloyoel gavriyashar ngwi phillipsyoav deswer60871 sumca1 hadaryosef

sefaria-export's Issues

CSV files?

You don't have the CSV files available. Could you put those in?

Also, this particular .csv can't be downloaded from sefaria:
https://www.sefaria.org/download/version/Rashi%20on%20Chullin%20-%20he%20-%20merged.csv

That particular link is from the Download Text section at the bottom of this page:
https://www.sefaria.org/Rashi_on_Chullin?lang=bi

I'm not sure if this was the place to put this...

Any plans to include BDB in the export?

I noticed you guys have what seems to be the most complete rendering of the BDB on the internet (in many respects, better than the copy-written one Biblesoft has licensed to many popular websites.) After necromancing around the depths of the interwebs, I found a thread where your CPO (@EliezerIsrael) pointed to bdb_parse being the fruit of a team at UTexas digitizing the public domain BDB printing.

Their project failed to get a grant renewal before the rough edges were polished. Specifically, many foreign languages (Aramaic, Arabic, Ethiopian, etc.) were left in an encoded state. The Hebrew itself seemed to have an issue with ordering and encoding het and tsade.

After just glancing through your final results, I can see you seem to have fixed all of the above issues while seeming to have removed details encoded into the original (i.e., the original language of a given word.) I can tell a lot of work occurred between the final results of your public repository Sefaria-BDB and the content displayed on the site.

I also noticed that these final results were excluded from the large dump at Sefaria-Export. Given how much effort went into its production, I am unsure if this is intentional. That being said, I figured I'd give it a shot and politely request that it be included in the export, if possible.

Thanks a ton,

Julian Wagle

XML Pull Request

Can you export the XML data?
Thanks!

Tanakh/Torah organized by weekly parsha?

When looking at the JSON file representation of the 5 books, I'm unable to decipher how these are structured. It doesn't appear that these arrays are broken down by what the standard weekly parsha reading would be.

How are they broken down?

Is there any way to generate array items that would group the text by lines in the parsha?

Versions of the Talmud

I am doing some research on Talmud text, using the excellent edition of William Davidson. I have several questions:

What is the "merged" version?
Some tractates have no William Davidson edition; are you planning to add these tractates in the near future?

Thanks!!!

Siddur License

Hi! We are curious whether we can use your siddur commercially. What is the license?

Joshua/English/The Koren Jerusalem Bible.json missing translations for 21:36-37

Basically what it says there on the tin. JSON file just has two sets of "" where the translation should be

Letter counts for the Torah appear to be incorrect according to several sources

According to my counts:
total: 305172 (not including items in [] in the raw .txt files)

א: 27069
ב: 16357
ג: 2111
ד: 7039
ה: 28085
ו: 30596
ז: 2199
ח: 7194
ט: 1806
י: 31607
ך: 3358
כ: 8614
ל: 21583
ם: 10630
מ: 14474
ן: 4260
נ: 9889
ס: 1836
ע: 11270
ף: 830
פ: 3976
ץ: 1035
צ: 2937
ק: 4700
ר: 18147
ש: 15605
ת: 17965

https://www.torahmusings.com/2012/10/letters-of-the-torah/
https://nifla-ot.co.il/2016/01/27/%D7%90%D7%95%D7%AA%D7%99%D7%95%D7%AA-%D7%91%D7%AA%D7%A0%D7%9A/

How were these files created?

are links directional or undirectional?

this analysis: https://lizshayne.wordpress.com/2014/06/17/sefaria_in_gephi/ interpret links as having a meaningful directionality, however i could not find the documentation for this, and some examples suggest that the order of citation 1/citation 2 may be arbitrary and actually describes an undirected graph. can you confirm?

Is this GIT still active?

Inconsistent escape characters

The end of each line in Mishnah in the json directory (also several other sources), each line ends with a \n escape character. However, in some places (i.e. Mishnah Tahorot/Hebrew/merged.json) some chapters are missing the escape character.

Is the Zohar on sefaria.org the same as the JSON ones in this export?

I am looking at https://www.sefaria.org/Zohar%2C_Introduction.3.8?lang=bi&with=Translations&lang2=en and trying to find the Hebrew text in the JSON, but it is not there I don't think. Which version of the Hebrew Zohar is on Sefaria.org, and is it to be found in the export somewhere?

For example, this is all it found:

From this on the site:

Is the .txt files different from the JSON?

Texts not using appropriate Unicode punctuation

Some Sefaria texts are using similar-shaped glyphs when more semantically-appropriate glyphs exist in the unicode standard:

e.g.
U+0022 QUOTATION MARK in place of U+05F4 HEBREW PUNCTUATION GERSHAYIM
U+0027 APOSTROPHE in place of U+05F3 HEBREW PUNCTUATION GERESH
U+003A COLON in place of U+05C3 HEBREW PUNCTUATION SOF PASUQ

see as an example https://raw.githubusercontent.com/Sefaria/Sefaria-Export/master/cltk-flat/Midrash/Aggadic%20Midrash/Midrash%20Tehillim/Hebrew/merged.json

Smaller data dump is corrupted

I have tried multiple times to download the data without text history and I cannot extract the tarball. Unexpected EOF. Please verify that file is healthy.

Converting output to well known book formats

Has anyone created a software that can convert one of Sefaria's formats (say, json) to an e-book like epub, mobi, etc?
If not, is there is anyway a documentation of Sefaria's formats, so I can create one myself?

תבורכו מן שמיא

Thanks. Please regenerate export to include Chullin and Bechoros.

Double or split of Midrash Rabbah - Bereshit Lviv, 1874

In Bereishit Rabbah/Hebrew there's a file named " Midrash Rabbah - Bereshit Lviv, 1874.json" with a space at the beginning of the file name, and a smaller file with the same name but without a space at the beginning. The larger one contains chapter 67 (Vayetze), and the smaller one contains chapter 23 (Toledot)...

Download does not include vocalised text for tractate Peshachim in Talmud Bavli

On the sefaria website I am able to access vocalised text for the Pesachim tractate of the Talmud Bavli but the download doesn't seem to include it. On closer inspection I noticed that this is the case for a number of tractates. Are there any plans to update and include the vocalised versions?

Meaning of HTML in the merged.json files?

What is the meaning of <big> and stuff like ?

Duplicates leading to large repository sizes

cltk-flat and cltk-full seem to duplicate a lot of the content from the json directory. Each one of these directories is 4.1GB, meaning that a git clone operation is extremely slow and requires a lot of disk. (Sparse clone is theoretically possible but very fiddly to set up and very slow to execute, and it has problems with the number of files in the schema directory.)

Would it be possible to do one of the following?

Put the cltk* material in a separate git repository
Have a helper script that re-builds the cltk* based on the information in the json directory if needed
Have a helper script that downloads the cltk* from an FTP site if needed
Have the cltk* file trees use symlinks to, rather than duplicating the files from, json
refactor the code not to need largely-redundant file trees

How to map references in links to specific text fragment?

The link files contain citations in the following format:

"A New Israeli Commentary on Pirkei Avot 1:10:13" -> "Sanhedrin 4a"

The table also contains book names and these are easy to identify and find in the repo.

However, it is not clear how to easily find the fragments referred to in the citation columns without writing custom parsers.

For example, Sanhedrin 4a - there isn't anything (like an index) in the json (the same applies to the other formats) structure of the Sanhedrin file to find the text extract itself.

Moreover, even if I were to write custom parsers for these references, they only point to the beginning of the extract and not the end.

On the other hand, the Sefaria application successfully maps all references (and the websites too obviously). What am I missing?

Which English version of Genesis in Sefaria is the best?

I asked this on the Judaism StackExchange: https://judaism.stackexchange.com/questions/134282/what-is-the-best-free-version-of-the-book-of-genesis-in-english-which-is-comple

To summarize, I am just looking for Genesis in English, to begin my search to find quality open source English translations of the Torah (and potentially Tanakh or larger books in the future, but Torah is pretty universal). However, it appears not even is the Genesis Sefaria JSON complete or accurate. Can you briefly outline the state of affairs when it comes to English translations in JSON (or other format), particularly with Genesis?

For Genesis in particular, what is the best English translation you have? Are you planning on adding more? If you don't have a quality one, know of a quality open source (free) one online somewhere by chance?

Thank you for all you've done, what an invaluable resource Sefaria is so far!

Can you clarify the licenses of this content?

This is very cool content! Your README states:

Structured Jewish texts and metadata with free public licenses, exported from Sefaria's database.

Can you clarify what those licenses are? I imagine that different licenses affect how people can use this information.

Thanks!

Can't checkout repository on Windows

Hi,

Not sure if this error will happen on other OS, however on Windows, a checkout fails, since one of the folder names contains invalid characters for a Windows file/folder:

fatal: cannot create directory at 'cltk-flat/Modern Works/Works of Eliezer Berkovits/Conversion "According to Halakhah"; What Is It': Invalid argument warning: Clone succeeded, but checkout failed.

Windows does not allow quotes in a file name

Where is the NJPS translation of Tanakh?

I was looking for this specific translation, which I know exists on Sefaria's website, but couldn't find it in this repo (see here, for example of this translation being used).

Thanks for all your hard work! Our program, Tanach Study has benefited tremendously from your texts.

database download link is not valid

Also - drop box might not be the greatest option. It used to be I had to add it to my dropbox account before I downloaded it and it was too big for the free account. I think a google drive or s3 link might be a better option.

Please update sefaria export

Seems it didn't updated for over a year now

Generating and publishing statistics e.g. size of various texts

I've looked in various places for a sense of how large each of the texts of Sefaria are, e.g. as listed on the main page, with no luck. Is there a page that summarizes that, or a tool to process the export data to calculate sizes in terms of word counts, pages, or whatever is common this realm?

I guess this might end up on the Metrics page: https://www.sefaria.org/metrics

Thanks for this great project!

Difference between Sefaria app and Sefaria-Export

Hello,

I use the Sefaria app and I noticed that there are differences in the text there and this one.

For example, in the app, Genesis 2:10 it says:

"A river issues from Eden to water the garden, and it then divides and becomes four branches."

But, from the merged.json [and merged.txt], it says:

"And a river went out of Eden to water the garden; and from thence it was parted, and became four heads."

Can you kindly explain why this is so? What can be done to normalize this?

ERROR FOUND in Malachi, Lamentations, Ecclesiastes.json texts

Hi,

We are developing TorahBibleCodes Equidistant Letter Sequences (ELS) Search Software based upon the Tanach texts that you have provided:
https://github.com/TorahBibleCodes/TorahBibleCodes

We have forked your texts, and have found an error because of unnecessary HTML tags included in the JSON Hebrew Text of Malachi on last line here:

https://github.com/Sefaria/Sefaria-Export/blob/master/json/Tanakh/Prophets/Malachi/Hebrew/Tanach%20with%20Text%20Only.json

Here is copy of the line:
"והשיב לב־אבות על־בנים ולב בנים על־אבותם פן־אבוא והכיתי את־הארץ חרם [הנה אנכי שלח לכם את אליה הנביא לפני בוא יום יהוה הגדול והנורא]"

We will remove these on our local copy of your texts that are together with our program, but are concerned that those who choose to fork and/or clone your texts directly will encounter this error/bug when running our program using your texts until you are able to correct these errors.

Thank you.

TorahBibleCodes.com
https://github.com/TorahBibleCodes

Do you have a French translation of Tanakh or at least Houmach?

Mapping headers in the text files

Is there any file that maps the titles in the books?
[I did not see any markings in the body of the text]

What do these formatting styles in Hebrew texts from Sefaria mean?

I am cross-posting a question I posted on the Literature StackExchange: What do these formatting styles in Hebrew texts from Sefaria mean?. Any help would be appreciated. Thank you.

Invalid File Paths Prevent Cloning on Certain Systems

Description:

When attempting to clone the repository on my system, I encountered an error due to an invalid file path. The specific error message received was:

error: invalid path 'cltk-flat/Jewish Thought/Modern/Eliezer Berkovits/Conversion "According to Halakhah"; What Is It/English/Conversion According to Halakhah - What Is It Judaism 23 Fall 1974 467-78.json'

The issue appears to be caused by the quotation marks and semicolons in the file path, as these characters are not allowed in file names on certain systems.

Steps to Reproduce:

Attempt to clone the repository using the command: git clone https://github.com/Sefaria/Sefaria-Export.git

Expected Behavior:

The repository should be cloned without any errors, regardless of the system used.

Actual Behavior:

The cloning process fails due to an invalid file path error.

Additional Information:

Operating System: Win10

Possible Solution:

Consider renaming files or folders in the repository to avoid using special characters such as quotation marks or semicolons, which are not compatible with certain file systems.

Add Jastrow here

Hi there I just wanted to request the Jastrow dicitonary as txt or json on this repo.
Todah Rabbah.

Jastrow dictionary

Is the Jastrow dictionary available among your structured data? I am not able to find it... Thank you very much in advance.

Incorrect syntax for restore the mongoDB

The correct syntax is :
mongorestore --drop -d <database-name> <directory-of-dumped-backup>
Ex : $ mongorestore --drop -d sepharia sepharia
Important notice for the directory : I suppose you are in the directory dump. if not type : dump/sepharia

Path collision: Case-sensitive files on a case-insensitive filesystem (APFS)

There are a handful of JSON files that have 2 versions–one with an uppercased letter and one without. Is this intentional?

The macos APFS filesystem (while being capable of case-sensitivity is almost always configured to be case-insensitive). sefer is a popular collision.

on a case-insensitive filesystem) and only one from the same
colliding group is in the working tree:

  'cltk-flat/Halakhah/Sefer HaChinukh/Hebrew/*Sefer* HaChinukh -- Torat Emet.json'
  'cltk-flat/Halakhah/Sefer HaChinukh/Hebrew/*sefer* HaChinukh -- Torat Emet.json'
  'cltk-flat/Talmud/Bavli/Commentary/Chidushei Halachot/Seder Moed/Chidushei Halachot on Taanit/Hebrew/Vilna Edition.json'
  'cltk-flat/Talmud/Bavli/Commentary/Chidushei Halachot/Seder Moed/Chidushei Halachot on Taanit/Hebrew/Vilna edition.json'
  'cltk-flat/Tanakh/Commentary/Malbim/Prophets/Malbim on Jonah/Hebrew/Malbim On Jonah--Wikisource.json'
  'cltk-flat/Tanakh/Commentary/Malbim/Prophets/Malbim on Jonah/Hebrew/Malbim on Jonah--Wikisource.json'
  'cltk-full/Halakhah/Sefer HaChinukh/Hebrew/Sefer HaChinukh -- Torat Emet.json'
  'cltk-full/Halakhah/Sefer HaChinukh/Hebrew/sefer HaChinukh -- Torat Emet.json'
  'cltk-full/Talmud/Bavli/Commentary/Chidushei Halachot/Seder Moed/Chidushei Halachot on Taanit/Hebrew/Vilna Edition.json'
  'cltk-full/Talmud/Bavli/Commentary/Chidushei Halachot/Seder Moed/Chidushei Halachot on Taanit/Hebrew/Vilna edition.json'
  'cltk-full/Tanakh/Commentary/Malbim/Prophets/Malbim on Jonah/Hebrew/Malbim On Jonah--Wikisource.json'
  'cltk-full/Tanakh/Commentary/Malbim/Prophets/Malbim on Jonah/Hebrew/Malbim on Jonah--Wikisource.json'
  'json/Halakhah/Sefer HaChinukh/Hebrew/Sefer HaChinukh -- Torat Emet.json'
  'json/Halakhah/Sefer HaChinukh/Hebrew/sefer HaChinukh -- Torat Emet.json'
  'json/Talmud/Bavli/Commentary/Chidushei Halachot/Seder Moed/Chidushei Halachot on Taanit/Hebrew/Vilna Edition.json'
  'json/Talmud/Bavli/Commentary/Chidushei Halachot/Seder Moed/Chidushei Halachot on Taanit/Hebrew/Vilna edition.json'
  'json/Tanakh/Commentary/Malbim/Prophets/Malbim on Jonah/Hebrew/Malbim On Jonah--Wikisource.json'
  'json/Tanakh/Commentary/Malbim/Prophets/Malbim on Jonah/Hebrew/Malbim on Jonah--Wikisource.json'
  'txt/Halakhah/Sefer HaChinukh/Hebrew/Sefer HaChinukh -- Torat Emet.txt'
  'txt/Halakhah/Sefer HaChinukh/Hebrew/sefer HaChinukh -- Torat Emet.txt'
  'txt/Talmud/Bavli/Commentary/Chidushei Halachot/Seder Moed/Chidushei Halachot on Taanit/Hebrew/Vilna Edition.txt'
  'txt/Talmud/Bavli/Commentary/Chidushei Halachot/Seder Moed/Chidushei Halachot on Taanit/Hebrew/Vilna edition.txt'
  'txt/Tanakh/Commentary/Malbim/Prophets/Malbim on Jonah/Hebrew/Malbim On Jonah--Wikisource.txt'
  'txt/Tanakh/Commentary/Malbim/Prophets/Malbim on Jonah/Hebrew/Malbim on Jonah--Wikisource.txt'```

Clone succeeded, but checkout failed.

Add manual update info to README

From e.g. #9 and #32, it's apparently not obvious from the README that this repository is manually and intermittently updated.

It would be nice if the README made that clearer, along with a "last update" date on the README itself and not just in the commit history.

Maybe this can be done on the next update - it looks like it's about time 😉

MongoDB restore fails

on running the mongodb restore from dump (smaller version), I get:

Failed: sefaria.webpages: error creating indexes for sefaria.webpages: createIndex error: WiredTigerIndex::insert: key too large to index, failing 1149 { : "https://yutorah.org/daf.cfm/6040/taanit/2/a/static/rand=0.9185715476199539&iit=1635769073858&tmr=load=1635769073708&core=1635769073742&amp..." }

This is apparently caused by trying to index a string too long?