Poddley

"Shazam" for Podcasts

🎙️ About Poddley

Poddley is the "Shazam for podcasts," an innovative service designed to be a comprehensive search engine for podcast transcriptions. It provides a nexus between various podcast resources, including YouTube, Apple Podcasts, transcripts, time-stamped quotes, RSS feeds, homepages, and episode links. More than just a search tool, Poddley is a discovery platform, empowering users to unearth and enjoy a vast array of podcast content based on the spoken word.

🔍 Main Features

📜 Transcription Search: Instantly find podcasts through searchable transcriptions.
🔗 Resource Mapping: Seamless integration with platforms like YouTube and Apple Podcasts.
🕓 Time-Location Quotes: Locate the exact moment a phrase was spoken in a podcast episode.
🌐 Discovery Platform: Explore new podcasts with ease, tailored to your interests.

💡 Vision

The vision for Poddley is to transform how people discover and interact with podcast content. By bridging the gap between various podcasting platforms and enriching the search experience, Poddley aspires to be the go-to resource for podcast enthusiasts and casual listeners alike.

Project

Link to Poddley

Project duration (ca.)

24.11.2022 - 23.10.2023

Status build

Design iterations

Realizations

Hydration mismatch when using cloudflare as proxy is due to cloudflares automatic html, css and js minifcation plus dns proxy. Disabling it causes hydration blinking to disappear. Als
Don't optimize too early
Too much caching is bad
Debouncing API should be (human reaction time in ms - API latency).
No amount of time optimizing backend will save you from long TTFB (Time To First Byte). After spending a week optimizing backend, testing out Vercel and Netlify (Pro and Free tier) trying to get speed-index below 2 seconds. Finally decided to try Cloudflare, went straight to 1.2 seconds.
Unused CSS and third-party script/services can be tricky to deal with.
Lazy-loading is nice.
Assets compression:
- Decided to use Brotli Compression to improve transfer time. Since Brotli was designed to compress streams on the fly, it is faster at both compressing content on the server and decompressing it in the browser than in gzip. In some cases the overall front-end decompression is up to 64% faster than gzip. Source
- Only using webp on website due to faster loading speed and better compression during transfer Source

Frontend:

Nuxt 3 for client-stuff:
- SSR enabled
- Server-structure (as website can't be generated statically due to dynamic route-params)
- Nitro-server with gzip- and brotli-compression and minified assets
JSON to TypeScript type for types generation based on API response
TypeScript.
Cloudflare as the DNS-manager for easier setup and automatic image resizing.
Tracking: ~~Plausible~~ Switched to Cloudflare as it was free
ServiceWorker for offloading the main-thread from the frequest API-calls to the backend-API. There are multiple ways to solve this. Throttling + Debouncing on user-input (during instantSearch) is a possibility, but it often causes laggy ui and mucky logic. Offloading it all to a ServiceWorker showed much better results in spite of it being tricky to implement.
Nuxt 3 modules used:
- TailwindCSS module (integrated PurgeCSS and fast design development)
- NuxtImage module
- HeadlessUI module
- SVG-sprite-module (for reducing SVG-requests to server)
- Nuxt Delayed Hydration module (for improved Lighthouse score and loading time)
- Lodash module (for _Debounce-function)
- Device module (for iPhone-device detection)
- Pinia Nuxt Module (for global storage across components)
- Nightwind Tailwind plugin (for deep automaticaally generated tailwind classes for nightmode)
- Nuxt Image Cloudflare (passes correct image width and height to cloudflare image url causes cloudflare to automatically resize and compress image to be the smallest payload)'
- Floating UI + Nuxt3 Headless UI to have automatic flipping of UI when outside of viewport
- Nuxt3 Google Fonts module for async fetching of google fonts (probably GDPR breach, but...)

Backend:

Services:

The services are running primarily as pm2-processes. With daemon-autorestart on server-shutdown, which are:

Express-API: API that queries the meilisearch instance.
Route-Controller-Service architecture for ExpressJS/Node-backends. Architecture
Indexer (runs every 30min): Updates meilisearch indexes based on db-data
RSS-updater (runs very 30min): Updates db (upsert) based on changes in rss-feeds
Transcriber/YoutubeGetter (runs continuously) (can be run concurrently due to db-row locking)
Meilisearch-instance (native rust): Does the full-text search functionality

Architecture:

Server 1: Very low-end DigitalOcean droplet with 2vCPUs and 4Gigs of RAM run the database consistently. I need the db to be constantly available due to the transcriber and indexer fetching to and from it. Server 2: Higher end 8vCPUs and 16Gigs of RAM running the Express API, Meilisearch search engine, the indexer and the RSS-updater Local PC: Runs the Transcriber on an ASUS GeForce RTX 3060 DUAL OC as it's much cheaper than to rent any kind of Server

Overall Architecture

Location	Specifications	Responsibilities
Server 1	Very low-end DigitalOcean droplet with 2vCPUs and 4Gigs of RAM	Runs the database consistently (needed for transcriber and indexer as I can't have it shut down sporadically for whatever reason)
Server 2	Higher end 8vCPUs and 16Gigs of RAM	Running the Express API, Nuxt3-client code (planning to at least) Meilisearch search engine, the indexer, and the RSS-updater
Local PC	ASUS GeForce RTX 3060 DUAL OC	Runs the Transcriber

Meilisearch pm2 config

A meilisearch instance running with the following settings (all indexes use the default settings), besides what's specified in the backend scripts.

Meilisearch pm2 config

	module.exports = {
	  apps: [
	    {
	      name: "meilisearch",
	      script: `./meilisearch --no-analytics`,
	      env: {
	        MEILI_HTTP_ADDR: "0.0.0.0:7700",
	        MEILI_MASTER_KEY: "some key here",
	      },
	    },
	  ],
	};

Indexes

All Meilisearch Indexes

	{
	    "results": [
	        {
	            "uid": "episodes",
	            "createdAt": "2023-07-16T09:34:08.17973422Z",
	            "updatedAt": "2023-09-23T11:11:03.374123595Z",
	            "primaryKey": "id"
	        },
	        {
	            "uid": "podcasts",
	            "createdAt": "2023-07-16T09:34:08.151876845Z",
	            "updatedAt": "2023-09-23T11:39:47.029653816Z",
	            "primaryKey": "id"
	        },
	        {
	            "uid": "segments",
	            "createdAt": "2023-07-16T09:34:08.099810144Z",
	            "updatedAt": "2023-10-01T15:16:28.788349424Z",
	            "primaryKey": "id"
	        },
	        {
	            "uid": "transcriptions",
	            "createdAt": "2023-07-16T09:34:08.05460293Z",
	            "updatedAt": "2023-09-23T11:08:48.119600807Z",
	            "primaryKey": "id"
	        }
	    ],
	    "offset": 0,
	    "limit": 20,
	    "total": 4
	}

Podcasts:

Podcasts Index Settings

	{
	    "displayedAttributes": [
	        "*"
	    ],
	    "searchableAttributes": [
	        "*"
	    ],
	    "filterableAttributes": [
	        "podcastGuid"
	    ],
	    "sortableAttributes": [],
	    "rankingRules": [
	        "words",
	        "typo",
	        "proximity",
	        "attribute",
	        "sort",
	        "exactness"
	    ],
	    "stopWords": [],
	    "synonyms": {},
	    "distinctAttribute": null,
	    "typoTolerance": {
	        "enabled": true,
	        "minWordSizeForTypos": {
	            "oneTypo": 5,
	            "twoTypos": 9
	        },
	        "disableOnWords": [],
	        "disableOnAttributes": []
	    },
	    "faceting": {
	        "maxValuesPerFacet": 100,
	        "sortFacetValuesBy": {
	            "*": "alpha"
	        }
	    },
	    "pagination": {
	        "maxTotalHits": 1000
	    }
	}

Episodes:

Episodes Index Settings

	{
	    "displayedAttributes": [
	        "*"
	    ],
	    "searchableAttributes": [
	        "*"
	    ],
	    "filterableAttributes": [
	        "episodeGuid"
	    ],
	    "sortableAttributes": [
	        "addedDate"
	    ],
	    "rankingRules": [
	        "words",
	        "typo",
	        "proximity",
	        "attribute",
	        "sort",
	        "exactness"
	    ],
	    "stopWords": [],
	    "synonyms": {},
	    "distinctAttribute": null,
	    "typoTolerance": {
	        "enabled": true,
	        "minWordSizeForTypos": {
	            "oneTypo": 5,
	            "twoTypos": 9
	        },
	        "disableOnWords": [],
	        "disableOnAttributes": []
	    },
	    "faceting": {
	        "maxValuesPerFacet": 100,
	        "sortFacetValuesBy": {
	            "*": "alpha"
	        }
	    },
	    "pagination": {
	        "maxTotalHits": 1000
	    }
	}

Transcriptions:

Transcriptions Index Settings

	{
	    "displayedAttributes": [
	        "*"
	    ],
	    "searchableAttributes": [
	        "transcription"
	    ],
	    "filterableAttributes": [],
	    "sortableAttributes": [],
	    "rankingRules": [
	        "exactness",
	        "proximity",
	        "typo",
	        "words"
	    ],
	    "stopWords": [],
	    "synonyms": {},
	    "distinctAttribute": null,
	    "typoTolerance": {
	        "enabled": true,
	        "minWordSizeForTypos": {
	            "oneTypo": 5,
	            "twoTypos": 9
	        },
	        "disableOnWords": [],
	        "disableOnAttributes": []
	    },
	    "faceting": {
	        "maxValuesPerFacet": 100,
	        "sortFacetValuesBy": {
	            "*": "alpha"
	        }
	    },
	    "pagination": {
	        "maxTotalHits": 1000
	    }
	}

Segments:

Segments Index Settings

	{
	    "displayedAttributes": [
	        "*"
	    ],
	    "searchableAttributes": [
	        "text"
	    ],
	    "filterableAttributes": [
	        "belongsToEpisodeGuid",
	        "belongsToPodcastGuid",
	        "belongsToTranscriptGuid",
	        "end",
	        "id",
	        "start"
	    ],
	    "sortableAttributes": [
	        "start"
	    ],
	    "rankingRules": [
	        "exactness",
	        "sort",
	        "proximity",
	        "typo",
	        "words"
	    ],
	    "stopWords": [],
	    "synonyms": {},
	    "distinctAttribute": null,
	    "typoTolerance": {
	        "enabled": true,
	        "minWordSizeForTypos": {
	            "oneTypo": 5,
	            "twoTypos": 9
	        },
	        "disableOnWords": [],
	        "disableOnAttributes": []
	    },
	    "faceting": {
	        "maxValuesPerFacet": 200,
	        "sortFacetValuesBy": {
	            "*": "alpha"
	        }
	    },
	    "pagination": {
	        "maxTotalHits": 5000
	    }
	}

Transcriber/Re-alignment-service

The transcriber is a python script that grabs a selection of podcast names from a json.
Queries a SQLite database downloaded daily from PodcastIndex.com
Uses feedparser to get episode-names, audiofiles, titles etc. from the rss-feeds for further parsing
Uses WhispherX to transcribe and align the segments. This implementation is better than the original Whisper due to it using faster-whisper under the hood which supports batching among other performance-improvements.
Then uses WhisperX to re-align the timestamps in accordance with the audio file (using the large wav2vec model.
Then finds the youtube video that fits to that audio file and updates the episode in the database.

Nginx settings:

Nginx Settings

For meilisearch.poddley.com

	server {
	    listen 80;
	    server_name meilisearch.poddley.com;
	
	    # Add this line to increase max upload size
	    client_max_body_size 30M;
	
	    location / {
		proxy_pass http://localhost:7700;
		proxy_http_version 1.1;
		proxy_set_header Upgrade $http_upgrade;
		proxy_set_header Connection 'upgrade';
		proxy_set_header Host $host;
		proxy_cache_bypass $http_upgrade;
	    }
	
	    listen 443 ssl;
	    ssl_certificate /etc/letsencrypt/live/meilisearch.poddley.com/fullchain.pem;
	    ssl_certificate_key /etc/letsencrypt/live/meilisearch.poddley.com/privkey.pem;
	    include /etc/letsencrypt/options-ssl-nginx.conf;
	    ssl_dhparam /etc/letsencrypt/ssl-dhparams.pem;
	}
	
	server {
	    listen 80;
	    server_name poddley.com;
	
	    location / {
		proxy_pass http://localhost:3001;
		proxy_http_version 1.1;
		proxy_set_header Upgrade $http_upgrade;
		proxy_set_header Connection 'upgrade';
		proxy_set_header Host $host;
		proxy_cache_bypass $http_upgrade;
	    }
	
	    listen 443 ssl; # managed by Certbot
	    ssl_certificate /etc/letsencrypt/live/poddley.com/fullchain.pem; # managed by Certbot
	    ssl_certificate_key /etc/letsencrypt/live/poddley.com/privkey.pem; # managed by Certbot
	    include /etc/letsencrypt/options-ssl-nginx.conf; # managed by Certbot
	    ssl_dhparam /etc/letsencrypt/ssl-dhparams.pem; # managed by Certbot
	}
	
	server {
	    listen 80;
	    server_name api.poddley.com;
	
	    # Add this line to increase max upload size
	    client_max_body_size 30M;
	
	    location / {
		proxy_pass http://localhost:3000;
		proxy_http_version 1.1;
		proxy_set_header Upgrade $http_upgrade;
		proxy_set_header Connection 'upgrade';
		proxy_set_header Host $host;
		proxy_cache_bypass $http_upgrade;
	    }
	
	    listen 443 ssl;
	    ssl_certificate /etc/letsencrypt/live/api.poddley.com/fullchain.pem;
	    ssl_certificate_key /etc/letsencrypt/live/api.poddley.com/privkey.pem;
	    include /etc/letsencrypt/options-ssl-nginx.conf;
	    ssl_dhparam /etc/letsencrypt/ssl-dhparams.pem;
	}

Other

HTTPS everywhere done with let's encrypt. Free https certificates

AI services

All AI services run 24/7 on a ASUS GeForce RTX 3060 Dual OC V2
I used to run and do tests on runpod.io due to their cheap prices, but realized quickly that long term use would quickly become expensive. Paperspace was even more expensive. Deepgram was ridiculous expensive.
The AI models were initially running on my local computer running an RTX 1650, but it was crashing frequently and had insufficient GPU memory (would terminate sporadically). I also tried running an RTX3060 using ADT-Link connected to my Legion 5 AMD Lenovo gaming laption through the M.2 NVME as an eGPU. That was deeply unsuccessful due to frequent crashes. All solutions were unsatisfactory so splurged for a workstation in the end.

Stuff to do:

TODO

luvnft / podsearch Goto Github PK

podsearch's Introduction

Poddley

"Shazam" for Podcasts

🎙️ About Poddley

🔍 Main Features

💡 Vision

Project

Project duration (ca.)

Status build

Design iterations

Realizations

Frontend:

Backend:

Services:

Architecture:

Meilisearch pm2 config

Indexes

Podcasts:

Episodes:

Transcriptions:

Segments:

Transcriber/Re-alignment-service

Nginx settings:

For meilisearch.poddley.com

Other

AI services

Stuff to do:

podsearch's People

Contributors

Recommend Projects

Recommend Topics

Recommend Org