Giter VIP home page Giter VIP logo

fuuka's People

Contributors

anounyym1 avatar desuwa avatar eksopl avatar oohnoitz avatar voldemortgui avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

fuuka's Issues

New parsing error

Looks like one post causes this now

http://archive.rebeccablacktech.com/panic.txt

Error parsing post 24955541:
------
<div class="postContainer opContainer" id="pc24955541"><div id="p24955541" class="post op"><div class="postInfoM mobile" id="pim24955541"><span class="nameBlock"><span class="name"></span><br><span class="subject"></span></span><span class="dateTime postNum" data-utc=""><br><em><a href="24955541#p24955541" title="Highlight this post">No.</a><a href="javascript:quote('24955541');" title="Quote this post">24955541</a></em></span></div><div class="postInfo desktop" id="pi24955541"><input type="checkbox" name="24955541" value="delete"/> <span class="subject"></span> <span class="nameBlock"><span class="name"></span></span> <span class="dateTime" data-utc=""></span> <span class="postNum"><a href="24955541#p24955541" title="Highlight this post">No.</a><a href="javascript:quote('24955541');" title="Quote this post">24955541</a> &nbsp; </span> </div><blockquote class="postMessage" id="m24955541"></blockquote> </div>
------
 at Board.pm line 247 thread 14
        Board::troubles('Board::Yotsuba=HASH(0x7faa08029758)', 'Error parsing post 24955541:
------
<div class="postContainer...') called at Board/Yotsuba.pm line 194 thread 14
        Board::Yotsuba::parse_post('Board::Yotsuba=HASH(0x7faa08029758)', '<div class="postContainer opContainer" id="pc24955541"><div i...', 0) called at Board/Yotsuba.pm line 125 thread 14
        Board::Yotsuba::parse_thread('Board::Yotsuba=HASH(0x7faa08029758)', '<div class="postContainer opContainer" id="pc24955541"><div i...') called at Board/Yotsuba.pm line 313 thread 14
        Board::Yotsuba::get_thread('Board::Yotsuba=HASH(0x7faa08029758)', 24955541, undef) called at Board.pm line 123 thread 14
        Board::__ANON__() called at Board.pm line 129 thread 14
        Board::content('Board::Yotsuba=HASH(0x7faa08029758)', 'Board::Request::THREAD=ARRAY(0x7faa08016478)') called at ./board-dump.pl line 328 thread 14
        main::__ANON__() called at ./board-dump.pl line 377 thread 14
        eval {...} called at ./board-dump.pl line 377 thread 14
Error parsing thread (see failed post above)
------
 at Board.pm line 247 thread 14
        Board::troubles('Board::Yotsuba=HASH(0x7faa08029758)', 'Error parsing thread (see failed post above)\x{a}------\x{a}') called at Board/Yotsuba.pm line 127 thread 14
        Board::Yotsuba::parse_thread('Board::Yotsuba=HASH(0x7faa08029758)', '<div class="postContainer opContainer" id="pc24955541"><div i...') called at Board/Yotsuba.pm line 313 thread 14
        Board::Yotsuba::get_thread('Board::Yotsuba=HASH(0x7faa08029758)', 24955541, undef) called at Board.pm line 123 thread 14
        Board::__ANON__() called at Board.pm line 129 thread 14
        Board::content('Board::Yotsuba=HASH(0x7faa08029758)', 'Board::Request::THREAD=ARRAY(0x7faa08016478)') called at ./board-dump.pl line 328 thread 14
        main::__ANON__() called at ./board-dump.pl line 377 thread 14
        eval {...} called at ./board-dump.pl line 377 thread 14
[26 48 1 181 22 ] Couldn't insert posts into database: Must specify a thread number for this board

Investigate dumper crash in sub marks_delete

The dumpers often die inside the marks_delete function. Usually line 88, with:

Thread 21 terminated abnormally: Invalid value for shared scalar at ./board-dump.pl line 88.

Most likely a race condition where a thread's info is deleted while another thread is in the middle of marking posts as deleted, but I'm having troubles reproducing.


(Imported from issue 24 @ Google Code)

Posts not displaying

Not much to be said about it, the entries are in the database, images in the Board folder, yet nothing displays on the page using the most recent code from github, configuration is identical to what it was before I updated it

It puts me at a standstill as the previous version's board-dumper keeps dying.

Fix slow queries for thread listing

Unfortunately, the query change by FoOlRulez/FoOlFuuka for ghost mode listing that allowed me to drop the horrible hack that the *_local boards were is too heavily dependent on query cache, and it doesn't really scale. Matters are made worse by utf8mb4, which apparently trashes the cache a little bit more? (perhaps using BLOB fields is a better approach, considering we don't really need MySQL to know about what kind of data it's storing for us, as long as one is using Sphinx).

Instead of just delaying the inevitable, fix this issue by introducing a table that lists threads, along with useful data for them, which will allow simplifying the queries for thread listing by quite a lot.

Refine backlink preview function

Backlink preview on hover has been implemented with the help of a patch from @desuwa, but I'd like to still:

  • Make it an optional feature, either selectable by the admin in the config file (easier) or by the user, through cookies. I'm not sure which one to do yet.
  • Add forwardlinks, like the ones in FoOlFuuka (and in 4chan X).


(Imported from issue 72 @ Google Code)

More parse errors

More of the same "two thread posts in one thread", barely 5 minutes after starting the dumpers back up. Not sure what additional info to include.

http://nyafuu.org/vohtemp/panic-1352012.txt

EDIT: panic.txt having gotten to 12mb in about 2 1/2 hours, not sure if its happening more frequently or just because its from three boards instead of one.

Dumper dies on Windows at getgrnam

Updating after whatever that layout change, board-dump.pl started giving me this:
"Thread 13 terminated abnormally: The getgrnam function is unimplemented at Board/Local.pm line 28."

Just updated perl modules to make sure, and only edited the shebang on board-dump.pl, didn't touch local.pm.

Built-in botnet support

With different archivers popping up every once in a while, it'd be nice if the ones that are meant to be public could ping back to some central place that would list active archivers and their links.

ENABLE_BOTNET in board-config would default to false, of course.

Some newhtml dumper error

After newhtml update I started get this error on dumper:

Error parsing post 24506058:
------
<div class="postContainer opContainer" id="pc24506058"><div id="p24506058" class="post op"><div class="postInfoM mobile" id="pim24506058"><span class="postNum nameBlock"><span class="subject">Attention 4chan extension/script/archive developers!</span> <span class="name"><span style="color:#F00000">moot</span></span> <span class="postertrip"><span style="color:#FF0000;font-weight:normal">!Ep8pui8Vw2</span></span> <span class="commentpostername"><span style="color:#F00000">## Admin</span></span><br /><em><a href="res/24506058#p24506058" title="Highlight this post">No.</a><a href="res/24506058#q24506058" title="Quote this post">24506058</a></em></span><span class="dateTime" data-utc="1335585262">04/27/12(Fri)23:54</span></div><div class="file" id="f24506058"><div class="fileInfo"><span class="fileText">File: <a href="//images.4chan.org/g/src/1335585262638.jpg" target="_blank">1335585262.jpg</a>-(18 KB, 476x356, <span title="bertstare.jpg">bertstare.jpg</span>)</span></div><a class="fileThumb" href="//images.4chan.org/g/src/1335585262638.jpg" target="_blank"><img src="//0.thumbs.4chan.org/g/thumb/1335585262638s.jpg" alt="18 KB" data-md5="9y3GCEbhpKTcHI8UQXXU+A==" style="height: 188px; width: 251px;" /></a></div><div class="postInfo" id="pi24506058"><input type="checkbox" name="24506058" value="delete" /> <span class="subject">Attention 4chan extension/script/archive developers!</span> <span class="nameBlock"><span class="name"><span style="color:#F00000">moot</span></span> <span class="postertrip"><span style="color:#FF0000;font-weight:normal">!Ep8pui8Vw2</span></span> <span class="commentpostername"><span style="color:#F00000">## Admin</span></span></span> <span class="dateTime" data-utc="1335585262">04/27/12(Fri)23:54</span> <span class="postNum"><a href="res/24506058#p24506058" title="Highlight this post">No.</a><a href="res/24506058#q24506058" title="Quote this post">24506058</a> <img src="//static.4chan.org/image/sticky.gif" alt="Sticky" title="Sticky" /> <img src="//static.4chan.org/image/closed.gif" alt="Closed" title="Closed" /> &nbsp; [<a href="res/24506058" class="replylink">Reply</a>]</span> </div><blockquote class="postMessage" id="m24506058"><div style="padding: 5px;margin-left: .5em;border-color: #faa;border: 2px dashed rgba(255,0,0,.1);border-radius: 2px">Soon we'll roll out an HTML rewrite across all of the imageboards. The design will remain the same&#44; but the underlying HTML/CSS has been rewritten from scratch. It is HTML5/CSS3&#44; and validates with the exception of a few CSS hacks for cross-browser compatibility.<br /><br />We've made these changes with you in mind. Our existing HTML is about /ten years old/&#44; and is a hodgepodge of tables and spans. The new HTML should be much easier to parse&#44; and when benchmarking the official 4chan Chrome extension&#44; we found that it parses approximately 600% faster.<br /><br />Please visit <a href="/htmlnew/" class="quotelink">&gt;&gt;&gt;/htmlnew/</a> to see the changes. We've tried to include every test case for things you'll see in production. Read through the posts to see some of the notes we've made pointing out specific changes.<br /><br />In addition&#44; CORS is now supported on www.4chan.org and sys.4chan.org&#44; with an origin of boards.4chan.org (HTTP/HTTPS supported). And the new code is a responsive design for mobile browsers.<br /><br />The new code will probably be rolled out some time this weekend. If you maintain an extension&#44; userscript&#44; or archiver&#44; please make your updates as soon as possible.<br /><br />Feel free to send feedback/questions to [email protected].</div></blockquote> </div><div class="postLink mobile"><span class="info"></span>
------
 at Board.pm line 247 thread 11
        Board::troubles('Board::Yotsuba=HASH(0x7f33a40295e8)', 'Error parsing post 24506058:
------
<div class="postContainer...') called at Board/Yotsuba.pm line 189 thread 11
        Board::Yotsuba::parse_post('Board::Yotsuba=HASH(0x7f33a40295e8)', '<div class="postContainer opContainer" id="pc24506058"><div i...', 0) called at Board/Yotsuba.pm line 125 thread 11
        Board::Yotsuba::parse_thread('Board::Yotsuba=HASH(0x7f33a40295e8)', '<div class="postContainer opContainer" id="pc24506058"><div i...') called at Board/Yotsuba.pm line 340 thread 11
        Board::Yotsuba::get_page('Board::Yotsuba=HASH(0x7f33a40295e8)', 0, 'Sun, 13 May 2012 20:42:45 GMT') called at Board.pm line 124 thread 11
        Board::__ANON__() called at Board.pm line 129 thread 11
        Board::content('Board::Yotsuba=HASH(0x7f33a40295e8)', 'Board::Request::PAGE=ARRAY(0x7f33a461e798)') called at ./board-dump.pl line 216 thread 11
        main::__ANON__() called at ./board-dump.pl line 302 thread 11
        eval {...} called at ./board-dump.pl line 302 thread 11
Error parsing thread (see failed post above)
------
 at Board.pm line 247 thread 11
        Board::troubles('Board::Yotsuba=HASH(0x7f33a40295e8)', 'Error parsing thread (see failed post above)\x{a}------\x{a}') called at Board/Yotsuba.pm line 127 thread 11
        Board::Yotsuba::parse_thread('Board::Yotsuba=HASH(0x7f33a40295e8)', '<div class="postContainer opContainer" id="pc24506058"><div i...') called at Board/Yotsuba.pm line 340 thread 11
        Board::Yotsuba::get_page('Board::Yotsuba=HASH(0x7f33a40295e8)', 0, 'Sun, 13 May 2012 20:42:45 GMT') called at Board.pm line 124 thread 11
        Board::__ANON__() called at Board.pm line 129 thread 11
        Board::content('Board::Yotsuba=HASH(0x7f33a40295e8)', 'Board::Request::PAGE=ARRAY(0x7f33a461e798)') called at ./board-dump.pl line 216 thread 11
        main::__ANON__() called at ./board-dump.pl line 302 thread 11
        eval {...} called at ./board-dump.pl line 302 thread 11

Otherwise it looks to working.

Make some of the reports update in real time

Some of the reports run prohibitively expensive SQL queries. This can and should be mitigated by having triggers and auxiliary tables that track said data on the fly.

Supporting better DB engines in the future will also depends on this, since count(*)s tend to be very expensive in good DBs. InnoDB on MySQL and PostgreSQL behave that way, for example.

Support for 2ch-style poster IDs

Given that moot enabled forced Anonymous and 2ch-style IDs for /b/, the dumper should support them.

Sadly, this means yet another column to store the poster ID.

Prevent archival of banned images

Prevent banned images from ever being fetched by the dumper. Obvious levels of "bannedness" are:

  1. Don't archive the full image. Get the thumb only.
  2. Won't fetch thumb, won't fetch full image.
  3. Won't fetch thumb or image, will automatically call the cops through an Asterisk PBX and relay a message generated through text-to-speech synth.
  4. Won't fetch thumb or image, get the poster's IP through 4chan's private API (needs a key from moot), email the poster details to Anonymous Jones, who will get the Mossad working on it.


(Imported from issue 35 @ Google Code)

Fix manual report updating

Manual report updating doesn't actually work, even when accessing through localhost. May or may not be related to file permissions.


(Imported from issue 82 @ Google Code)

Check out why the dumper is so memory hungry

The dumper is sightly too memory hungry. As I first looked at dumper code, the obvious reason was because the dumper keeps all 150 live threads, with all of its posts in memory, but a first attempt at writing code to purge those seems to shave no more than 10% of memory (I was able to drop mem usage for a dumper that wasn't archiving media or previews from 161 to 147 MiB, which checks out, at ~100 KiB per thread.) This suggests shared data structures aren't the issue here.

Since Perl threads aren't lightweight by design, the other possible reason might be that there's just too many modules and other constant data structures that are loaded with each thread, and this issue might be unfixable without rewriting the dumper in another language.

Post reporting

Add the ability to let users report posts. This does depend on some kind of admin interface, though, and there is currently none.


(Imported from issue 73 @ Google Code)

Make user reports show latest name for trippers

As can be seen on https://archive.installgentoo.net/g/reports/post-count - I have some mojibake next to my trip there.
Now, the characters getting messed up probably isn't fuuka's fault, however I haven't had that smiley thing in my name since 2009 I think.

So, the issue here is that post-count used to show the last name used with a trip, but now shows the first name since the archiving began which can't be intentional.
Aaron tells me it's a fuuka issue and wanted me to report it so yeah.

Strange parse errors in dumper

Fairly frequently, the dumper will start outputting what appears to be the entire page or multiple pages of the board

http://pastebin.com/b43MtGuY
-last section of the output, where the uninitialized values are

http://nyafuu.org/temp/dumper-error.txt
-FULL log, everything, all the crap it throws out, pastebin wouldn't let me paste it so here's a .txt, if you care to have it (warning ,~4000 lines)

http://pastebin.com/SGxD2s4H
This happened this time before the mass output, not sure if its related, the first line might just be my scrolling up in PuTTY which is what I used to catch the output.

Don't store duplicate images

A quick check for /a/ shows that 66% of all images are reposts. It seems that one can considerably lower the disk space needed for thumbnail and full image archival by storing each unique image only once.

Requires dumper changes and possibly a rather heavy migration script, but it's feasible.

Support If-Modified-Since headers in dumper

Implementing this seems a bit pointless, since it requires storing even more state (the dumpers already consume enough RAM as-is) and it doesn't strike me as something that brings any kind of bandwidth saving in normal usage.

The behavior of fuuka's dumper is to hit on index pages, which, on 4chan, are almost always likely to have changed. It only refreshes specific threads that it's tracking with a very long delay, usually something like 15 minutes. But while there's no bandwidth saving to be had in the former case, I suppose there might be some savings in the latter.

Nevertheless, this might be something worth looking into after the actually serious issues with the dumper are fixed (I am looking at you, issue #2), even if just for the sake of completion.

Thread/reply title search

For some larger boards it might be useful for threads that use it instead of the comment field, I got it to work for exact values, haven't the knowledge to get past that.

Sane images path

Yeah, you know what I mean.

rebeccablacktech archives images for about a day, and I'd like 4chan X to be able to redirect dead images to them, or other potential future image archivers using fuuka.

Extra bits of information that aren't being saved

Just writing down the bits of information we're not storing.

  • Poster IDs (tracked in GH-16)
  • EXIF info (/p/)
  • Locked thread info
  • Timestamp of thread expiration
  • Deleted media info (currently treated as posts without media)

Feel free to add more stuff here if you can think of anything.

DB schema tracking bug

Tagging @woxxy, @oohnoitz.

I believe this is what we decided to support, at least for now.

  • What indexes can I drop for MySQL+Sphinx and MySQL+FT? There's a lot of unnecessary indexes there for MySQL+Sphinx, but I don't know what you guys need on your end. Same for MySQL+FT, but that one actually uses most of the indexes. I want to separate indexes that are only needed for MySQL+FT from the ones that are needed by both.

  • I believe we agreed that we would drop support for MyISAM on the main table. Fuuka, Asagi and FoolFuuka all need changes in order to support the MySQL InnoDB+MyISAM FT search scheme, for non-Sphinx environments. You guys good with that?

    MySQL+FT schema. This table would not be created for MySQL+Sphinx environments. I'm guessing:
    a_search: doc_id | title | comment

    Email, username, tripcode and can all be handled by MySQL exact matches with regular MySQL btree indexes. I'm offering filename search on Fuuka, and I can't decide if I should go for exact matches or fulltext matches on that. If the latter, a_search on MySQL+FT would need to also have the media field there. What do you guys think?

  • Field names for thumb / images are driving me up the wall. These names are all wrong, they've always been wrong and I keep making mistakes in the code because the names are so incoherent with each other. I'm not entirely sold on media / preview / orig_filename. With the image deduplication scheme, preview should actually be preview_orig and orig_filename kinda makes sense as media_orig. (media should matchpreview. The fact that media_filename is the field that actually matches with preview and media is the media filename has always been a great source of error for me). Then on the images table, media_filename should be media and preview_op and preview_reply are just dandy. I'm sorry for keeping on changing the names on you guys, but this really is the last one that makes sense, don't you think?

    media -> name of full image on 4chan
    preview -> name of thumb on 4chan
    preview_op -> name of OP thumb on 4chan that got saved locally
    preview_reply -> name of reply thumb on 4chan that got saved locally
    media_filename -> filename of image in the hard drive of the user for that post (the filename that shows up in the post, ie: the ONE FIELD that is deserving of having filename in its name)

    So in the main table, we'd have: media_orig | preview_orig | media_filename
    And in the _images table: media_id | media_hash | media | preview_op | preview_reply

    This makes a SHITLOAD more sense.

    I know this is confusing because I'm switching the names of media with media_filename, but the thing is, THOSE NAMES HAVE ALWAYS BEEN SWAPPED, and I keep getting so fucking confused because of it. I really, really want to settle this once and for all, and since we're making DB changes, hey, perfect moment.

    Do you guys want to kill me for this? You can link to this page the next time you have to go down for DB changes and a thread inevitably pops up on /jp/. Add that "Eksopl is the one pushing changes after changes of time consuming database because he wants to give the impression that foolz is always" to fuel the conspiracies, if you want.

  • What name did we settle on for 2ch-like IDs? poster_hash? I believe we were avoiding _id so it doesn't get mixed up with actual ids we use in the database.

  • On that note, we were going to rename the id field to something, right? What was it? poster_ip?

  • We agreed that, at least for now, we're going to make it so the auxiliary tables can be fully recreated from the main table, so fields like height/width that could be moved to the _images table aren't going to be moved at this time (in case of a botched up migration, an archive owner can just drop aux tables and run scripts to recreate them). We're adding poster_hash, I can't say I care about locked, because it's only ever paired with stickies, EXIF can wait because no one is archiving /p/ and deleted media info, while interesting, can also wait until a later date, when not so much is happening at once. Ttimestamp of thread expiration sounds neat (I lifted that idea after looking at the schema of nih). However, I don't want to litter the main table with a field that's only going to be used when parent = 0. Should we wait on that until a later date so we can put it on the _threads table?

  • You can pull request the migration script you guys made once you're ready to roll. As I posted in roadmap, I won't merge until htmlnew is rolled out, though. We also need scripts to drop indexes on Sphinx and create the extra table on MyISAM FT, but I can do those.

Anything else?

Allow user to filter images

Let the user filter out images they find objectionable through a simple JS function that makes a list with the hashes of said images, stores it through cookies and shows another image that enables said user to acknowledge how much of a wuss they're being.


(Imported from issue 20 @ Google Code)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.