jina-ai / gsoc Goto Github PK

Google Summer of Code

gsoc's Introduction

Google Summer of Code 2023

Jina AI has been selected as one of 19 organizations in Infrastructure and Cloud for Google Summer of Code 2023! GSoC is an open source internship program offering paid remote work.

Almost anyone in the world over 18 years of age who loves coding and wants to explore the incredible world of open source can join us as a GSoC 2023 contributor.

🎥 In our GSoC x Jina AI webinar, our mentors presented their projects in depth and answered questions people had about the project requirements, find the recording here, and here Is the slides.

This page contains a list of potential project ideas that we want to develop during GSoC 2023. If you would like to apply as a GSoC contributor, please follow these steps to get started:

Read through this page and the Google Summer of Code guides,
Identify, or come up with your own project ideas on Issues.
Fill out the survey so that we can know you better.
Please Introduce yourself in #introductions channel in our Slack community.
Join #gsoc-support channel to public communicate with potential mentors.
We have a proposal template to help ensure you include all information we expect. All applications must be submitted through Google's application system from March 20 to April 4.

For details and rules of GSoC, please read the GSoC Manual, Timeline, and GSoC FAQs.

🔎 Who are we?

Welcome to the GSoC projects page of Jina AI!

Jina AI provides a powerful platform for building neural-search, generative AI services with cloud-native technology, and we are thrilled to participate in Google Summer of Code (GSoC) this year. Our goal is to provide students with opportunities to work on real-world, cutting-edge projects and contribute to the growth of our community.

We are firm supporters of open source and have open sourced multiple projects, including:

Jina is a MLOps framework that empowers anyone to build multimodal AI services via cloud native technologies. It uplifts a local PoC into a production-ready service. Jina handles the infrastructure complexity, making advanced solution engineering and cloud-native technologies accessible to every developer. Check out the documentation!
DocArray is a library for nested, unstructured, multimodal data in transit, including text, image, audio, video, 3D mesh, etc. It allows deep-learning engineers to efficiently process, embed, search, recommend, store, and transfer multimodal data with a Pythonic API. and DocArray is now hosted under the Linux Foundation AI & Data. Check out the documentation and roadmap!

💡 Project ideas

Title	Skills needed	Mentors	Difficulty	Size
Build Executor (model) UI in jina	Python	@Alaeddine Abdessalem, @Philip Vollet	Easy	175 hours, Medium
DocArray wrap ANN libraries	Python, ANN Search experience	@Johannes Messner, @Sami Jaghouar, @Philip Vollet	Medium	175 hours, Medium
Research about deploying LLM with Jina	Python, Pytorch, CUDA, docker, Kubernetes	@Alaeddine Abdessalem, @Joan Martínez	Medium	350 Hours, Large
Expand ANNLite capabilities with BM25 to build Hybrid Search	Python, C++, Lucene, ANN, Inverted Index	@Felix Wang, @Joan Martínez, @Girish Chandrashekar	Hard	350 hours, Large
Make ANNLite the go-to Vector Search library to be scaled by Jina using the StatefulExecutor feature	ANN, C++, Python, Databases	@Felix Wang, @Joan Martínez	Hard	350 Hours, Large
JAX support in DocArray v2	Python, AI/ML, JAX Framework experience	@Sami Jaghouar	Hard	175 hours, Medium

🧬 Project idea X: Your own idea!

If you have any ideas of your own, Please feel free to use Issues to draft your project ideas, ask questions, and collaborate. Project ideas need to be approved by the instructor before they can be formally accepted.

Project idea template

1. Title
2. Summary: Short Project Description
3. Expected outcomes
4. Desired skills
5. Details
	- Skill needed
	- Project size
	- Difficulty level
	- Mentor: Email address
	- Suggested by: Person who suggested the idea

🤙 Contact

GitHub: Please use Issues to comment on project ideas, ask questions and collaborate.
Slack: We have our own Slack Community for communication.

🎓 What you will get

As a contributor in Jina AI's GSoC program, you will have the opportunity to:

Gain hands-on experience with real-world projects and technologies in the field of search and AI.
Work with a mentor from the Jina AI who will support and guide you throughout the program.
Upon successful project completion, get invited as a speaker for our community events.
Regular feedback and help on your efforts, including blogposts, with quick responses from us
Develop your technical skills, including software development, machine learning, and open-source contributions.
Build your professional network and make connections with other students, developers, and experts in the field.
Receive a stipend from Google for your participation in the program.

📝 How to improve your chances of being accepted?

The best way to increase your chances of being accepted as a Jina AI GSoC student is to start contributing now. Read up on Jina's contribution documentation and make yourself known to the other contributors by your contributions (Preferably related to your proposal area). This way, when it comes time to evaluate student applications, you will be a recognized individual and more likely to receive the attention you need to develop a successful proposal.

We are looking for candidates who can demonstrate that they can work independently on a project. We are here to help, but we cannot monitor your progress every step of the way. Therefore, it is important to show us your motivation. Being active before the submission process is the best way to demonstrate this.

gsoc's People

Contributors

Stargazers

Watchers

Forkers

nick17t prakalp23 zhengwjie jamesbright sanvi-sundarrajan perfectionistaf anuj-kumar-aj nithin6x9 averagestud thebarcelonaguy rajbhoyar729 sanjana088

gsoc's Issues

Potential GSoC Project - Multimodal Real-Time Personality Recognition System

Hello,

My name is Euhid Aman. I am a final year undergraduate student from Presidency University-Bangalore, India. Currently I am doing a Research Internship at Embedded Systems Lab, at NCUE-Taiwan in the field of Artificial Neural Networks. And, I also have prior experience in projects related to AI and ANN. My research paper titled "AI Inspired ATC, Based on ANN and Using NLP" have also been published by SAE-International (Society of Automotive Engineers, is a United States-based, globally active professional association and standards developing organization for engineering professionals in various industries.). And a few of my other works related to ANN projects and theories are also in the process of publication. Most of my experiences with my prior organizations, lies with Python and Java programming languages, along with C, C++ and I am also familiar with a variety of other software and project development tools.

I would also like to mention that, I have been actively contributing to Open-Source in the past few years, and I would like to work with Jina AI in creating my desired project. As my project can highly benefit from Jina AI's vast set of tools, I believe that Jina is the right organization for me to choose for GSoC. I would really love to discuss about the below mentioned project in-detail, with someone from Jina AI.

Project Details:

Title: Multimodal Real-Time Personality Recognition System.
Summary: The project aims to develop a sophisticated system that can effectively identify and adapt to the varying personality traits of users. The system will analyze different types of user data, including speech, text, and visual cues, using Jina AI's advanced tools and frameworks. Through the use of machine learning techniques, the system will extract personality features from these multimodal inputs to create personalized user profiles. These profiles can then be used to tailor the system's responses and interactions with the user, resulting in a more customized and engaging experience. Overall, this project has the potential to revolutionize the way users interact with technology, making it more intuitive and personalized than ever before.
Expected outcomes: The expected outcome of this project is a functional prototype system that can accurately recognize and adapt to different personality traits of users based on their speech, text, and visual cues. This system can be useful for personalized recommendations, customer service, and other applications that require understanding of individual preferences.
Desired skills: To undertake this project, you should have a good understanding of Python, deep learning frameworks such as TensorFlow and Keras/PyTorch, and other relevant tools and libraries including Jina Ai's vast toolset. Experience in natural language processing and computer vision is also desirable.
Details:
- Skill needed: Python, deep learning frameworks, natural language processing, computer vision.
- Project size: 175 hours
- Difficulty level: Medium
- Mentor:
- Suggested by: Euhid Aman

I would really love to discuss and gain more about this idea, and whether it can be included as Jina AI's project in Google Summer of Code.
My email is [email protected] for further contact.

Kind Regards,
Euhid Aman

JAX support in DocArray v2

Project idea 6: JAX support in DocArray v2

Info	details
Skills needed	Python, deep learning , JAX
Project size	175 hours
Difficulty level	Hard
Mentors	@Sami Jaghouar

Project Description

DocArray is a library for representing, sending, and storing multi-modal data, with a focus on applications in ML and Neural Search. It currently supports several deep learning frameworks, including PyTorch and TensorFlow. Jax is becoming increasingly popular for deep learning, so we want to integrate it into DocArray.
The project we propose is to add Jax as a backend for DocArray, alongside PyTorch and TensorFlow. The first part would involve rewriting and translating all of the computational backend functions of DocArray with the Jax framework. Then, we would battle-test the implementation against a real Jax use case, such as integrating DocArray with Jax support for model training and serving.

Expected outcomes

We aim to provide JAX with the same level of support in DocArray as we do for PyTorch, Numpy, and TensorFlow. The integration should be thoroughly tested and documented.

Desired skills

Python proficiency is expected since the DocArray codebase is quite complete. Additionally, experience with the JAX framework and familiarity with the scientific Python ecosystem (e.g. NumPy, Torch, scikit-learn, etc.) is required.

More detailed :

This Project target DocArray, especially the current rewrite: DocArray v2 which is a new codebase.

We currently support three computational frameworks in DocArray v2 : Pytorch, Numpy, and TensorFlow, we would like to add JAX support.

More info about JAX can be found here but in short, it is a deep learning framework supported by Google that is getting a lot of traction, especially among researchers.

Concretely what is expected in this project:

Add a new backend to our Computational Backend while relying as much as possible on jnp (Jax Numpy) which is a numpy life interface for JAX. A similar approach can be found for the TensorFlow backend: https://github.com/docarray/docarray/blob/feat-rewrite-v2/docarray/computation/tensorflow_backend.py
Create a new Tensor object with the JAX backend: Example : ImageTensor will need a JAX variant (all of the other one as well)
Make DocumentArrayStack compatible with JAX. Hopefully, this should be straightforward with the computational backend agnostic but since we notice some problems with the TensorFlow backend we can expect some friction here
Battle test the whole computational backend:
- Unit test of each function in the computational backend
- unit test with the predefined tensor
- unit test with DocumentArrayStack
- Integration test to check the coherence of the whole implementation
- integration test on training a small NN with DocArray + JAX. Similar to https://github.com/docarray/docarray/blob/feat-rewrite-v2/tests/integrations/array/test_torch_train.py

Build Executor (model) UI in Jina

Project idea 1: Build Executor (model) UI in jina

info	details
Skills needed	Python
Project size	175 hours
Difficulty level	Easy
Mentors	@Alaeddine Abdessalem, @Philip Vollet

Project Brief Description

Jina Executors are components that perform certain tasks and expose them as services using gRPC. Executors accept DocumentArrays as input and output. However, with DocArray v2 focusing on type annotations and enabling annotation of executor endpoints, it becomes possible for executors to describe their services and input/output in the same way as OpenAPI schemas. This allows us to offer built-in UIs for executors, enabling people to easily use their services with multi-modal data. The goal is to build this feature in Jina using Gradio.

Detailed Description

A jina model UI should work like a swagger UI in an HTTP framework.
A swagger UI typically relies on an OpenAPI specifications to know about the services and input output schema, and then will generate the UI based on that.
However, jina executors rely on gRPC and gRPC do not expose OpenAPI. This means we either have to implement an OpenAPI equivalent for gRPC services that Executors expose or adopt a certain standard (I like the kserve).
Building such a service specification depends on knowing exactly the needed input and output schemas.
Jina can now support defining input and output schemas of the endpoints using the docarray v2 in beta support.
The API looks like this actually: https://docs.jina.ai/concepts/executor/docarray-v2/.
This means that, assuming an executor is built using DocArray v2 in type annotations, we can generate service specifications in the same way FastAPI generates open API specifications based on pydantic type hints.

Then, the next step would be using this service specification endpoint to build a model UI.
This can be done using gradio.
However, since we cannot serve a UI on the executor's gRPC service, we can only host the UI on the gateway level.
jina gateways recently added support for customization.
The new support allows basically building any server based on any protocol.
This means, a gateway can target an Executor within a jina flow and expose the UI for it.

I think the following API can make sense:

from jina import Flow

flow = Flow(protocol='ui').add(MyExecutorModel)

from jina import Flow, ModelUIGateway

flow = Flow().config_gateway(uses=ModelUIGateway).add(MyExecutorModel)

In jina, this would request the Executor's service specifications over grpc and then will dynamically build the gradio interface.

To sum it up, the proposed technologies are not necessarily what we need to adopt, as long as we implement the following:

Executor service specifications (like OpenAPI specifications in http) based on DocArray v2 type hints
Leverage the service specifications to build a model UI gateway. The UI Gateway is a custom Jina gateway and can use any framework (gradio, streamlit or even any web framework).
Integrate the model ui gateway into jina (offer simple API to use it)

Expected outcomes

Submit one or more Pull Requests (PRs) to the Jina repository that enables providing a built-in Executor UI for Executors. The UI can be built using Gradio and should be able to infer information about the Executor service using type annotations.

Improve Getting Started Documentation

Description:
The current Getting Started documentation could benefit from clarification and additional examples. This issue involves reviewing the existing documentation and making improvements to ensure that new contributors can easily set up the project and understand its structure.

Tasks:

Review the current Getting Started documentation.
Identify areas that may be unclear or lack detail.
Provide additional examples or code snippets where necessary.
Ensure that all installation steps are accurate and up-to-date.
Submit a pull request with the proposed changes.

Expand ANNLite capabilities with BM25 to build Hybrid Search

Project idea 4: Expand ANNLite capabilities with BM25 to build Hybrid Search

info	details
Skills needed	Python, C++, Lucene, ANN, Inverted Index
Project size	350 hours
Difficulty level	Hard
Mentors	@Felix Wang @Joan Martínez @Girish Chandrashekar

Project Description

In relation to Research about deploying LLM with Jina project, another interesting approach would be to incorporate BM25 and Hybrid Search into ANNLite, which would enable Jina to build scalable Hybrid Search solutions in the cloud with a powerful default solution.
ANNlite is a Vector search library developed by Jina which is using HNSW as the algorithm to perform a search. On top of this, it allows the filtering of Documents.
However, it can be important for the performance of search systems to be able to combine Vector Search algorithms with traditional text-search ones to get the best of both worlds.
This project is about evaluating and trying to apply Hybrid Search approaches on top of ANNLite.

Resources:

Expected outcomes

ANNLite is ready to be used as a default library to solve Hybrid Search applications.

Broken Slack Link

Description:
The current Slack link (https://slack.jina.ai/) appears to be broken or inaccessible. New contributors and community members may face difficulty joining the Slack channel to engage in discussions and seek help.

Tasks:

Verify the reported issue by attempting to access the Slack link.
Investigate and identify the cause of the broken link.
If the issue is related to a misconfiguration, update the link to the correct Slack workspace.
Test the updated link to ensure it redirects users to the appropriate Slack channel.

Make ANNLite the go-to Vector Search library to be scaled by Jina using the StatefulExecutor feature

Project idea 5: Make ANNLite the go-to Vector Search library to be scaled by Jina using the StatefulExecutor feature

info	details
Skills needed	ANN, C++, Python, Databases
Project size	350 hours
Difficulty level	Hard
Mentors	@Felix Wang @Joan Martínez

Project Description

Jina is developing a stateful executor feature that enables Deployments with a state to be replicated and scaled. This opens the door to having a Vector Database in our ecosystem effectively and robustly. Iterating on ANNLite to act as the "Lucene" for Jina would be a great opportunity.

Expected outcomes

Prove and come up with an Executor in our Hub that uses ANNlite or DocArray with ANNLite as a backend to be the default Vector Databases for all our examples for mid-sized data requirements.

More Info

In DocArray (v2) Document Stores (soon to be renamed to "Document Index"), we want to support multiple vector DBs and ANN libraries to give more options to the user
You can read more about DocArray here
And about Document Stores
But note that we are currently working on v2 of DocArray, which will be quite different. You can read more here
And for Document Stores in v2, see this PR: docarray/docarray#1124

Proposal: A virtual therapy AI system that utilizes speech-to-speech conversational AI and computer vision to aid individuals in managing mental health issues such as stress and anxiety.

Question about a potential GSoC project - AI Ethics (a research and/or implementation project)

Hello,

I hope you're doing well!

My name is Ana Blaž and I'm an undergraduate computer science student at the University of Ljubljana, Slovenia. Since starting university, I have worked in software engineering research at my university (my focus was applications for countryside digitalization and smart cities) as well as did two product management internships (Amazon and Microsoft). I would love to start my long-term open-source contributing journey by working on a project within Jina AI. I'm mostly programming in Java and Python, but I also have experience in JavaScript, C, PHP, and Dart. I'm very happy to learn any new languages and/or technologies, if that were the case, I would make sure to become sufficiently proficient before the start of the program.

I've been trying to find a project within GSoC that focuses on AI ethics/ethical AI, which is a big interest of mine, especially considering the last few months in the tech industry. Since you offer the option to "suggest your own project", I was wondering whether any projects could be designed to focus on that topic. Do you already happen to have any ethical AI strategies/policies in place/being implemented, or does anything within the topic align with your organization's strategy for the future? I would be interested in anything related to the topic, whether this is possibly doing research or implementing features.
If this is something relevant to Jina AI, I really would love to discuss the possibilities with you. Thank you for your time and consideration.

Using your template provided in the ideas list*:

Title: Helping Jina AI's users evaluate some of the ethical implications when using large language models (LLMs) for business purposes
Summary: Short Project Description - Researching an ethical concern that exists when using LLMs for business purposes. Using "Risk of leaking private data" as an example. This risk originates from the dataset containing said private data. Researching what kinds of private data were leaked in the past, including the ones mentioned in academic articles. Working on finding ways of identifying private data. Working on implementing a warning for the user that their dataset might contain private data (and maybe even what they could do to mitigate the risk).
Expected outcomes - Knowledge of the common types of private data that find their way into corpuses and how to identify them. A way to warn the user that they might be at risk.
Desired skills: research, software engineering
Details
- Skill needed: research, software engineering
- Project size: 175h
- Difficulty level: medium?
- Mentor: ?
- Suggested by: Ana Blaž

*I really am happy to discuss any other project that makes more sense for Jina AI's vision. I got the idea for this particular project from your blog when you were mentioning the importance of considering the ethical implications when using LLMs for business.

Kind regards,
Ana Blaž

Proposal: Multi-modal Getting Started Examples

Multi-modal Getting Started Examples

Summary:

At the moment, there is only one getting started example (Create App v3.14.1) which is known to be misleading and incomplete, as noted in #5602 and #5712. Creating new Getting Started examples, which could be easily understood by everyone, is the goal of this project. We can achieve this result by completing the following:

Deliverables
In the first part, we would design from scratch the user experience (UX) of getting started examples. We should pay attention to make them accessible to everyone by defining an innovative communication approach, which includes different modalities (audio,visual,etc.) and accommodates different levels of expertise; the full set of requirements is shown as a table below.

Philosophy and Design Principles
Design Guidelines
Wireframe / Simple Prototype (Milestone)

In the second part, we would create a reusable template that should help us stay consistent with our design guidelines. This document must contain what is common to all getting started examples.

Create Document Template
Test Document Template
List of Getting Started Examples (Milestone)

Finally, we would craft the examples by adding contents and documentation to each.

Prepare Contents
Add documentation
Quality Assurance
(Optional) Reiterate

Expected outcomes

Create beginner examples to be added to Getting Started section of Jina documentation.

Desired skills

Python, Markdown, HTML, CSS, .JS

Details

Skill needed: Python, Markdown
Project size: 175h
Difficulty level: Medium
Mentor: ???
Suggested by: @giuliocn

Can I participate as an international student in the UK

Hello, please I am an international student studying for a Masters in Computer Science in the UK. This project is closely related to my area of research. Can I partake in this program, GSOCs website is a bit vague to my specific situation. I have a legal right to work in the UK, but my visa allows me to work 20hrs and 40hrs during holidays. Please I would like to have some clarification from the team before submitting my proposal. Thank you

Research about deploying LLM with Jina

Project idea 3: Research about deploying LLM with Jina

info	details
Skills needed	Python, Pytorch, CUDA, docker, Kubernetes
Project size	350 hours
Difficulty level	Hard
Mentors	@Alaeddine Abdessalem, @Joan Martínez

Project Description

Recently, large language models (LLMs) have gained attention for their ability to generate text to solve various tasks, such as question-answering, reading comprehension, and coding. However, most of these models are quite large, and deploying them requires certain technologies to be in place to enable scalability when using GPU resources.
These technologies include:

model optimizer and weights partitioning across multiple GPU devices, whether within the same node or across different nodes.
weight offloading
model compression
optimizing the model for the underlying hardware

There are different libraries that allow applying these technologies on LLMs to ease the deployment. We can name, for instance DeepSpeed, Accelerate or FlexGen.

We aim to assess the capability of deploying such models with Jina and explore what integrations we can build with the existing ecosystem to enable LLM inference using the Jina stack.

The idea is to build demos/showcases with these technologies to host an LLM using Jina. Potentially, if for some reason these libraries cannot be used within jina framework, we would build integrations to use these technologies within jina.

Expected outcomes

The project aims to demonstrate the capability of Jina to deploy and scale LLMs and build generative applications in a cost-efficient manner. Specific outcomes include:

Implementation of LLM deployment using Jina and assessing scalability with GPU resources
Documentation and example code demonstrating the use of Jina for LLM deployment and inference
Building integrations with the mentioned libraries in order to use them within jina.
Evaluation of the cost-efficiency of deploying and scaling LLMs with Jina compared to other technologies

DocArray wrap ANN libraries

Project idea 2: DocArray wrap ANN libraries

Info	details
Skills needed	Python, ANN Search experience
Project size	175 hours
Difficulty level	Medium
Mentors	@Johannes Messner, @Sami Jaghouar, @Philip Vollet

Project Description

In DocArray, we have been concentrating on developing production-ready Vector DBs for large-scale searches. However, there are many ANN libraries without scalability layers that can be integrated into DocArray, making it accessible to academia and production teams with small-to-medium amounts of data, without the need for external services.
DocArray v2 will have a concept called Document Index. This is an abstraction that lets a user store their Documents (on disk or in a database), and retrieve them using ANN search. As such, there can be multiple Document Indexes backed by different backends: Elastic, Qdrant, Weaviat, ...., but all following the same basic API.
The idea behind this project is to take an ANN library and use it to implement a Document Index. There is already an implementation using HNSWLib that you can find here: docarray/docarray#1124, But there is space to create similar backends using other libraries: Annoy, Faiss, ... The goal is to provide user choice.
If there is interest, someone could also implement a backend using a vector database. We already have Qdrant, Weaviate, and Elastic covered, but Milvus, Redis, and some others could also be interesting. You can find a design doc for Document Index here.

Expected outcomes

We have a set of DocStores implementations in DocArray that support the most popular ANN libraries, such as FAISS, Annoy, and Hnswlib.

jina-ai / gsoc Goto Github PK

gsoc's Introduction

Google Summer of Code 2023

🔎 Who are we?

💡 Project ideas

🧬 Project idea X: Your own idea!

🤙 Contact

🎓 What you will get

📝 How to improve your chances of being accepted?

gsoc's People

Contributors

Stargazers

Watchers

Forkers

gsoc's Issues

Project idea 6: JAX support in DocArray v2

Project Description

Expected outcomes

Desired skills

More detailed :

Project idea 1: Build Executor (model) UI in jina

Project Brief Description

Detailed Description

Expected outcomes

Project idea 4: Expand ANNLite capabilities with BM25 to build Hybrid Search

Project Description

Resources:

Expected outcomes

Project idea 5: Make ANNLite the go-to Vector Search library to be scaled by Jina using the StatefulExecutor feature

Project Description

Expected outcomes

Multi-modal Getting Started Examples

Summary:

Expected outcomes

Desired skills

Details

Project idea 3: Research about deploying LLM with Jina

Project Description

Expected outcomes

Project idea 2: DocArray wrap ANN libraries

Project Description

Expected outcomes

Recommend Projects

Recommend Topics

Recommend Org