Giter VIP home page Giter VIP logo

bayesian-beagle's Introduction

Bayesian beagle blog ๐Ÿถ

Welcome to the Bayesian beagle blog! This project is a unique intersection of machine learning and scientific communication, providing a platform where readers can quickly get insights from the latest research papers hosted on ArXiv. Utilizing state-of-the-art Large Language Models (LLMs), our system generates concise, comprehensible summaries of complex research articles, covering a wide array of disciplines.

Our blog is built using Quarto, an open-source scientific and technical publishing system designed for creating beautiful, data-driven content. It is then published with Netlify.

graph LR
    A["Download daily Arxiv articles"] --> B["Predict and Filter LLM topic"]
    B --> C["Summarize short docs"]
    B --> D["Summarize by Map-Reduce long docs"]
    C --> E["Update website with summaries daily"]
    D --> E

Netlify Status

Features

  • Curated ArXiv Articles: A handpicked selection of the most intriguing and high-impact research papers from various fields on ArXiv.
  • Automated Summaries: Each article is accompanied by a summary automatically generated by a sophisticated Large Language Model tailored for scientific content, utilizing Arxiv's new HTML (beta) formatting.
  • Regular Updates: Our collection is updated regularly via GitHub actions to include new research findings and innovations.
  • LLM-research: Coverage focuses on LLM-related research.

How It Works

  1. Article Selection: We curate a list of ArXiv articles based on recency, impact, and relevance to a diverse audience.
  2. Summary Generation: LLMs are employed to read and understand the selected articles and provide a human-readable summary.
  3. Blog Publication: These summaries are formatted and published as blog posts on our Quarto-powered platform.

Usage

The blog is live at https://bayesian-beagle.netlify.app/

Navigate to the blog using the provided link and enjoy the latest research summaries. If you're interested in how the blog is generated or want to suggest improvements, feel free to check the repository or open an issue.

Installation and Setup

To clone and run this project locally, you'll need Git, Quarto, and the necessary Python packages installed on your computer. From your command line:

# Clone this repository
git clone https://github.com/wesslen/bayesian-beagle.git

# Go into the repository
cd bayesian-beagle

# Create venv
python3.9 -m venv venv
source venv/bin/activate

# Install dependencies for summary
pip install -r requirements-summarizer.txt

# Install dependencies for build
pip install -r requirements-build.txt

# Install dependencies for langchain
pip install -r requirements-langchain.txt

# Curate arxiv ids in data/input.jsonl, ensure they have HTML renderings

# Run the summary generation script
python scripts/summarizer.py data/input.jsonl

# Run the summary generation script
python scripts/generate_qmd.py data/output.jsonl posts

# Build the Quarto blog
quarto render

Contributing

We welcome contributions from the community. Here's how you can help:

  • Suggest Articles: Know some great ArXiv papers that deserve a summary? Let us know!
  • Enhance Summaries: Help us refine the machine-generated summaries for accuracy and clarity.
  • Improve Code: Contribute to the code that powers the blog and the summary generation process.
  • Design and UX: Assist us in creating a more engaging and user-friendly interface.

To contribute, please fork the repository and push your changes, then open a pull request.

License

Distributed under the MIT License. See LICENSE for more information.

Acknowledgments

bayesian-beagle's People

Contributors

wesslen avatar

Stargazers

 avatar Raphael Mitsch avatar Alexandre Betioli avatar  avatar

Watchers

 avatar

bayesian-beagle's Issues

Add --force-generate-all flag; include it as a GH action

As a GH action:

on:
  workflow_dispatch:
    inputs:
      force_generate_all:
        description: 'Force generate all posts'
        required: false
        type: boolean
        default: false

      - name: Run script
        run: python generate_posts.py --force_generate_all ${{ github.event.inputs.force_generate_all }}

posts/prompt_weight_experiments_for_llm_instruction_fine_tuning/2024-01-24-prompt_weight_experiments_for_llm_instruction_fine_tuning

Bayesian beagle - Prompt Weight Experiments for LLM Instruction Fine-Tuning

Study examines impact of prompt token classification loss weighting on LLaMA models fine-tuned on instruction tasks. Results vary based on dataset length.

https://bayesian-beagle.netlify.app/posts/prompt_weight_experiments_for_llm_instruction_fine_tuning/2024-01-24-prompt_weight_experiments_for_llm_instruction_fine_tuning

Transition from JSONL File to SQLite Database for Page Generation

Issue Description

Currently, our static website built with Quarto is generating pages using a .jsonl file, where each record corresponds to an Arxiv ID. We aim to replace this with a SQLite database to enhance scalability and maintainability.

Steps to Implement

1. Create a SQLite Database

  • Design Database Schema: Reflect the structure of the .jsonl file in the database schema.
  • Initialize Database: Create a SQLite database with tables based on the schema.

2. Data Migration

  • Migration Script: Develop a Python script to migrate data from the .jsonl file to the SQLite database.
  • Data Integrity: Ensure all data is accurately transferred.

3. Update Python Scripts for Data Retrieval

  • Modify Existing Scripts: Adapt scripts that currently read the .jsonl file to instead interact with the SQLite database.
  • Implement Database Queries: Use Python libraries like sqlite3 to handle database operations.
  • Fetch Records by ID: Ensure the updated scripts can retrieve specific records using Arxiv IDs.

4. Testing

  • Test Data Retrieval: Thoroughly test the updated scripts for accurate database interactions.
  • Validate Page Generation: Compare newly generated pages with existing ones for consistency.

5. Documentation

  • Update Documentation: Reflect the transition from .jsonl to SQLite in the project documentation.
  • Setup Instructions: Include guidelines for database setup and script execution.

6. Deployment

  • Merge and Deploy: After testing, merge the changes into the main branch and update the live website.

Additional Considerations

  • Performance: Optimize query performance for large datasets.
  • Backup and Recovery: Implement a strategy for database backup.
  • CI/CD Pipeline: Update CI/CD processes to include the new database setup.

Request for Contributions

Contributions are welcome! If you're interested in helping with this transition, please comment on this issue or submit a pull request.

refactor `generate_qmd.py`

Here are the main functions of the code:

  1. convert_to_folder_name: This function takes a string as input and converts it to a folder name format by replacing spaces, slashes, question marks, colons, commas, and hyphens with underscores. It uses the translate method of strings with a translation table to perform the replacement.

  2. create_qmd_file: This function takes an example dictionary and an output folder path as input. It extracts relevant information from the example dictionary, such as the title, publish date, and image URL. It also creates a folder name based on the title using convert_to_folder_name. It then creates a QMD file path by combining the output folder path, folder name, and current date. It checks if the file already exists and returns if it does. Otherwise, it uses Jinja2 templating to render a QMD file content with the example data. It creates the output sub-folder if it doesn't exist and writes the rendered content to the file.

  3. generate_qmd: This is the main command of the script. It takes an input JSONL file path and an output folder path as input. It opens the JSONL file and iterates over each line. It loads the line as a JSON object and calls the create_qmd_file function to generate a QMD file for each example.

The code uses the typer package for command-line interface (CLI) parsing. It defines a CLI command generate_qmd that accepts the input JSONL file and output folder as arguments.

Suggestions for improving the code for programming best practices:

  1. Use type hints: Add type hints to function and variable declarations to improve code readability and maintainability.

  2. Use docstrings: Add docstrings to functions to provide a description of their purpose and usage.

  3. Handle exceptions: Add exception handling to catch and handle any potential errors that may occur during file operations, such as opening, reading, or writing files.

  4. Separate concerns: Consider breaking down the create_qmd_file function into smaller, more focused functions to improve code organization and readability.

  5. Use pathlib functions: Instead of manually constructing file paths using string concatenation, use the pathlib module's functions to manipulate file paths. For example, use Path(output_folder) / folder_name / file_name to create the file path.

  6. Use f-strings: Instead of concatenating strings using the + operator, use f-strings to improve string formatting and readability. For example, use f"File saved: {file_path}" instead of "File saved: " + file_path.

  7. Use a logger: Instead of printing messages directly to the console, consider using a logging library like logging to log messages with different levels of severity.

  8. Error handling for existing directories: Add error handling to check if the output folder already exist s and whether it is a valid directory before writing files to it.

  9. Unit tests: Add unit tests to verify the correctness of the functions and handle edge cases.

Weird model output

For Mixtral, found a `tldr' with code:

"subtitle": "Fine-tuning embeddings improves item retrieval in conversational recommendation agents.\n```python\n\n```"

Prevented quarto from rendering. Need to keep an eye out.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.