Giter VIP home page Giter VIP logo

sefaria-export's People

Contributors

blockspeiser avatar citelao avatar edamboritz avatar eliezerisrael avatar herzberg avatar nealmcb avatar ngwi avatar nsantacruz avatar rneiss avatar sefariabot avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sefaria-export's Issues

Letter counts for the Torah appear to be incorrect according to several sources

According to my counts:
total: 305172 (not including items in [] in the raw .txt files)

א: 27069
ב: 16357
ג: 2111
ד: 7039
ה: 28085
ו: 30596
ז: 2199
ח: 7194
ט: 1806
י: 31607
ך: 3358
כ: 8614
ל: 21583
ם: 10630
מ: 14474
ן: 4260
נ: 9889
ס: 1836
ע: 11270
ף: 830
פ: 3976
ץ: 1035
צ: 2937
ק: 4700
ר: 18147
ש: 15605
ת: 17965

https://www.torahmusings.com/2012/10/letters-of-the-torah/
https://nifla-ot.co.il/2016/01/27/%D7%90%D7%95%D7%AA%D7%99%D7%95%D7%AA-%D7%91%D7%AA%D7%A0%D7%9A/

How were these files created?

Invalid File Paths Prevent Cloning on Certain Systems

Description:

When attempting to clone the repository on my system, I encountered an error due to an invalid file path. The specific error message received was:

error: invalid path 'cltk-flat/Jewish Thought/Modern/Eliezer Berkovits/Conversion "According to Halakhah"; What Is It/English/Conversion According to Halakhah - What Is It Judaism 23 Fall 1974 467-78.json'

The issue appears to be caused by the quotation marks and semicolons in the file path, as these characters are not allowed in file names on certain systems.

Steps to Reproduce:

Attempt to clone the repository using the command: git clone https://github.com/Sefaria/Sefaria-Export.git

Expected Behavior:

The repository should be cloned without any errors, regardless of the system used.

Actual Behavior:

The cloning process fails due to an invalid file path error.

Additional Information:

Operating System: Win10

Possible Solution:

Consider renaming files or folders in the repository to avoid using special characters such as quotation marks or semicolons, which are not compatible with certain file systems.

Versions of the Talmud

I am doing some research on Talmud text, using the excellent edition of William Davidson. I have several questions:

  • What is the "merged" version?
  • Some tractates have no William Davidson edition; are you planning to add these tractates in the near future?

Thanks!!!

Incorrect syntax for restore the mongoDB

The correct syntax is :
mongorestore --drop -d <database-name> <directory-of-dumped-backup>
Ex : $ mongorestore --drop -d sepharia sepharia
Important notice for the directory : I suppose you are in the directory dump. if not type : dump/sepharia

How to map references in links to specific text fragment?

The link files contain citations in the following format:

"A New Israeli Commentary on Pirkei Avot 1:10:13" -> "Sanhedrin 4a"

The table also contains book names and these are easy to identify and find in the repo.

However, it is not clear how to easily find the fragments referred to in the citation columns without writing custom parsers.

For example, Sanhedrin 4a - there isn't anything (like an index) in the json (the same applies to the other formats) structure of the Sanhedrin file to find the text extract itself.

Moreover, even if I were to write custom parsers for these references, they only point to the beginning of the extract and not the end.

On the other hand, the Sefaria application successfully maps all references (and the websites too obviously). What am I missing?

Can you clarify the licenses of this content?

This is very cool content! Your README states:

Structured Jewish texts and metadata with free public licenses, exported from Sefaria's database.

Can you clarify what those licenses are? I imagine that different licenses affect how people can use this information.

Thanks!

ERROR FOUND in Malachi, Lamentations, Ecclesiastes.json texts

Hi,

We are developing TorahBibleCodes Equidistant Letter Sequences (ELS) Search Software based upon the Tanach texts that you have provided:
https://github.com/TorahBibleCodes/TorahBibleCodes

We have forked your texts, and have found an error because of unnecessary HTML tags included in the JSON Hebrew Text of Malachi on last line here:

https://github.com/Sefaria/Sefaria-Export/blob/master/json/Tanakh/Prophets/Malachi/Hebrew/Tanach%20with%20Text%20Only.json

Here is copy of the line:
"והשיב לב־אבות על־בנים ולב בנים על־אבותם פן־אבוא והכיתי את־הארץ חרם<br><small>[הנה אנכי שלח לכם את אליה הנביא לפני בוא יום יהוה הגדול והנורא]</small>"

We will remove these on our local copy of your texts that are together with our program, but are concerned that those who choose to fork and/or clone your texts directly will encounter this error/bug when running our program using your texts until you are able to correct these errors.

Thank you.

TorahBibleCodes.com
https://github.com/TorahBibleCodes

MongoDB restore fails

on running the mongodb restore from dump (smaller version), I get:

Failed: sefaria.webpages: error creating indexes for sefaria.webpages: createIndex error: WiredTigerIndex::insert: key too large to index, failing 1149 { : "https://yutorah.org/daf.cfm/6040/taanit/2/a/static/rand=0.9185715476199539&amp;iit=1635769073858&amp;tmr=load=1635769073708&amp;core=1635769073742&amp..." }

This is apparently caused by trying to index a string too long?

Tanakh/Torah organized by weekly parsha?

When looking at the JSON file representation of the 5 books, I'm unable to decipher how these are structured. It doesn't appear that these arrays are broken down by what the standard weekly parsha reading would be.

How are they broken down?

Is there any way to generate array items that would group the text by lines in the parsha?

Any plans to include BDB in the export?

I noticed you guys have what seems to be the most complete rendering of the BDB on the internet (in many respects, better than the copy-written one Biblesoft has licensed to many popular websites.) After necromancing around the depths of the interwebs, I found a thread where your CPO (@EliezerIsrael) pointed to bdb_parse being the fruit of a team at UTexas digitizing the public domain BDB printing.

Their project failed to get a grant renewal before the rough edges were polished. Specifically, many foreign languages (Aramaic, Arabic, Ethiopian, etc.) were left in an encoded state. The Hebrew itself seemed to have an issue with ordering and encoding het and tsade.

After just glancing through your final results, I can see you seem to have fixed all of the above issues while seeming to have removed details encoded into the original (i.e., the original language of a given word.) I can tell a lot of work occurred between the final results of your public repository Sefaria-BDB and the content displayed on the site.

I also noticed that these final results were excluded from the large dump at Sefaria-Export. Given how much effort went into its production, I am unsure if this is intentional. That being said, I figured I'd give it a shot and politely request that it be included in the export, if possible.

Thanks a ton,

Julian Wagle

Which English version of Genesis in Sefaria is the best?

I asked this on the Judaism StackExchange: https://judaism.stackexchange.com/questions/134282/what-is-the-best-free-version-of-the-book-of-genesis-in-english-which-is-comple

To summarize, I am just looking for Genesis in English, to begin my search to find quality open source English translations of the Torah (and potentially Tanakh or larger books in the future, but Torah is pretty universal). However, it appears not even is the Genesis Sefaria JSON complete or accurate. Can you briefly outline the state of affairs when it comes to English translations in JSON (or other format), particularly with Genesis?

For Genesis in particular, what is the best English translation you have? Are you planning on adding more? If you don't have a quality one, know of a quality open source (free) one online somewhere by chance?

Thank you for all you've done, what an invaluable resource Sefaria is so far!

Generating and publishing statistics e.g. size of various texts

I've looked in various places for a sense of how large each of the texts of Sefaria are, e.g. as listed on the main page, with no luck. Is there a page that summarizes that, or a tool to process the export data to calculate sizes in terms of word counts, pages, or whatever is common this realm?

I guess this might end up on the Metrics page: https://www.sefaria.org/metrics

Thanks for this great project!

Add Jastrow here

Hi there I just wanted to request the Jastrow dicitonary as txt or json on this repo.
Todah Rabbah.

Smaller data dump is corrupted

I have tried multiple times to download the data without text history and I cannot extract the tarball. Unexpected EOF. Please verify that file is healthy.

Duplicates leading to large repository sizes

cltk-flat and cltk-full seem to duplicate a lot of the content from the json directory. Each one of these directories is 4.1GB, meaning that a git clone operation is extremely slow and requires a lot of disk. (Sparse clone is theoretically possible but very fiddly to set up and very slow to execute, and it has problems with the number of files in the schema directory.)

Would it be possible to do one of the following?

  1. Put the cltk* material in a separate git repository
  2. Have a helper script that re-builds the cltk* based on the information in the json directory if needed
  3. Have a helper script that downloads the cltk* from an FTP site if needed
  4. Have the cltk* file trees use symlinks to, rather than duplicating the files from, json
  5. refactor the code not to need largely-redundant file trees

Jastrow dictionary

Is the Jastrow dictionary available among your structured data? I am not able to find it... Thank you very much in advance.

Siddur License

Hi! We are curious whether we can use your siddur commercially. What is the license?

database download link is not valid

Also - drop box might not be the greatest option. It used to be I had to add it to my dropbox account before I downloaded it and it was too big for the free account. I think a google drive or s3 link might be a better option.

Converting output to well known book formats

Hi

Has anyone created a software that can convert one of Sefaria's formats (say, json) to an e-book like epub, mobi, etc?
If not, is there is anyway a documentation of Sefaria's formats, so I can create one myself?

תבורכו מן שמיא

Path collision: Case-sensitive files on a case-insensitive filesystem (APFS)

There are a handful of JSON files that have 2 versions–one with an uppercased letter and one without. Is this intentional?

The macos APFS filesystem (while being capable of case-sensitivity is almost always configured to be case-insensitive). sefer is a popular collision.

on a case-insensitive filesystem) and only one from the same
colliding group is in the working tree:

  'cltk-flat/Halakhah/Sefer HaChinukh/Hebrew/*Sefer* HaChinukh -- Torat Emet.json'
  'cltk-flat/Halakhah/Sefer HaChinukh/Hebrew/*sefer* HaChinukh -- Torat Emet.json'
  'cltk-flat/Talmud/Bavli/Commentary/Chidushei Halachot/Seder Moed/Chidushei Halachot on Taanit/Hebrew/Vilna Edition.json'
  'cltk-flat/Talmud/Bavli/Commentary/Chidushei Halachot/Seder Moed/Chidushei Halachot on Taanit/Hebrew/Vilna edition.json'
  'cltk-flat/Tanakh/Commentary/Malbim/Prophets/Malbim on Jonah/Hebrew/Malbim On Jonah--Wikisource.json'
  'cltk-flat/Tanakh/Commentary/Malbim/Prophets/Malbim on Jonah/Hebrew/Malbim on Jonah--Wikisource.json'
  'cltk-full/Halakhah/Sefer HaChinukh/Hebrew/Sefer HaChinukh -- Torat Emet.json'
  'cltk-full/Halakhah/Sefer HaChinukh/Hebrew/sefer HaChinukh -- Torat Emet.json'
  'cltk-full/Talmud/Bavli/Commentary/Chidushei Halachot/Seder Moed/Chidushei Halachot on Taanit/Hebrew/Vilna Edition.json'
  'cltk-full/Talmud/Bavli/Commentary/Chidushei Halachot/Seder Moed/Chidushei Halachot on Taanit/Hebrew/Vilna edition.json'
  'cltk-full/Tanakh/Commentary/Malbim/Prophets/Malbim on Jonah/Hebrew/Malbim On Jonah--Wikisource.json'
  'cltk-full/Tanakh/Commentary/Malbim/Prophets/Malbim on Jonah/Hebrew/Malbim on Jonah--Wikisource.json'
  'json/Halakhah/Sefer HaChinukh/Hebrew/Sefer HaChinukh -- Torat Emet.json'
  'json/Halakhah/Sefer HaChinukh/Hebrew/sefer HaChinukh -- Torat Emet.json'
  'json/Talmud/Bavli/Commentary/Chidushei Halachot/Seder Moed/Chidushei Halachot on Taanit/Hebrew/Vilna Edition.json'
  'json/Talmud/Bavli/Commentary/Chidushei Halachot/Seder Moed/Chidushei Halachot on Taanit/Hebrew/Vilna edition.json'
  'json/Tanakh/Commentary/Malbim/Prophets/Malbim on Jonah/Hebrew/Malbim On Jonah--Wikisource.json'
  'json/Tanakh/Commentary/Malbim/Prophets/Malbim on Jonah/Hebrew/Malbim on Jonah--Wikisource.json'
  'txt/Halakhah/Sefer HaChinukh/Hebrew/Sefer HaChinukh -- Torat Emet.txt'
  'txt/Halakhah/Sefer HaChinukh/Hebrew/sefer HaChinukh -- Torat Emet.txt'
  'txt/Talmud/Bavli/Commentary/Chidushei Halachot/Seder Moed/Chidushei Halachot on Taanit/Hebrew/Vilna Edition.txt'
  'txt/Talmud/Bavli/Commentary/Chidushei Halachot/Seder Moed/Chidushei Halachot on Taanit/Hebrew/Vilna edition.txt'
  'txt/Tanakh/Commentary/Malbim/Prophets/Malbim on Jonah/Hebrew/Malbim On Jonah--Wikisource.txt'
  'txt/Tanakh/Commentary/Malbim/Prophets/Malbim on Jonah/Hebrew/Malbim on Jonah--Wikisource.txt'```

Can't checkout repository on Windows

Hi,

Not sure if this error will happen on other OS, however on Windows, a checkout fails, since one of the folder names contains invalid characters for a Windows file/folder:

fatal: cannot create directory at 'cltk-flat/Modern Works/Works of Eliezer Berkovits/Conversion "According to Halakhah"; What Is It': Invalid argument warning: Clone succeeded, but checkout failed.

Windows does not allow quotes in a file name

Inconsistent escape characters

The end of each line in Mishnah in the json directory (also several other sources), each line ends with a \n escape character. However, in some places (i.e. Mishnah Tahorot/Hebrew/merged.json) some chapters are missing the escape character.

Difference between Sefaria app and Sefaria-Export

Hello,

I use the Sefaria app and I noticed that there are differences in the text there and this one.

For example, in the app, Genesis 2:10 it says:

"A river issues from Eden to water the garden, and it then divides and becomes four branches."

But, from the merged.json [and merged.txt], it says:

"And a river went out of Eden to water the garden; and from thence it was parted, and became four heads."

Can you kindly explain why this is so? What can be done to normalize this?

Add manual update info to README

From e.g. #9 and #32, it's apparently not obvious from the README that this repository is manually and intermittently updated.

It would be nice if the README made that clearer, along with a "last update" date on the README itself and not just in the commit history.

Maybe this can be done on the next update - it looks like it's about time 😉

Texts not using appropriate Unicode punctuation

Some Sefaria texts are using similar-shaped glyphs when more semantically-appropriate glyphs exist in the unicode standard:

e.g.
U+0022 QUOTATION MARK in place of U+05F4 HEBREW PUNCTUATION GERSHAYIM
U+0027 APOSTROPHE in place of U+05F3 HEBREW PUNCTUATION GERESH
U+003A COLON in place of U+05C3 HEBREW PUNCTUATION SOF PASUQ

see as an example https://raw.githubusercontent.com/Sefaria/Sefaria-Export/master/cltk-flat/Midrash/Aggadic%20Midrash/Midrash%20Tehillim/Hebrew/merged.json

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.