Giter VIP home page Giter VIP logo

posthog-migration-tools's Introduction

PostHog self hosted migration tools (Alpha) .github/workflows/test.yml

TL;DR

We provide a bulk export functionality to allow users to migrate from self-hosted to Cloud deployments. This can be slow and unreliable.

Git clone this repo and run something like:

pip3 install -r requirements.txt
python3 ./migrate.py \
   --clickhouse-url https://some.clickhouse.cluster:8443 \
   --clickhouse-user default \
   --clickhouse-password some-password \
   --clickhouse-database posthog \
   --team-id 1234 \
   --posthog-url https://app.posthog.com \
   --posthog-api-token "abx123" \
   --start-date 2023-06-18T13:00:00Z \
   --end-date 2023-06-18T13:10:00 \
   --fetch-limit 10000

The script prints out a "cursor" that, in the case that the migration fails, can be used to resume from where if got too previously. That would look like:

python3 ./migrate.py \
   --clickhouse-url https://some.clickhouse.cluster:8443 \
   --clickhouse-user default \
   --clickhouse-password some-password \
   --clickhouse-database posthog \
   --team-id 1234 \
   --posthog-url https://app.posthog.com \
   --posthog-api-token "abx123" \
   --start-date 2023-06-18T13:00:00Z \
   --end-date 2023-06-18T13:10:00 \
   --fetch-limit 10000 \
   --cursor the-cursor-value-from-the-output

Run python3 ./migrate.py --help to get a complete list of options.

What it does

To aid in getting people moved over, this tool:

  1. reads event data directly from ClickHouse.
  2. uses the PostHog Python library to ingest the data into PostHog cloud.

Why pull directly from ClickHouse? Mainly, it removes the requirement to have a working installation of PostHog down to just needing to have ClickHouse responsive. It will also help with performance.

NOTE: this script will add a $lib = posthog-python property, overriding anything else that was already there.

Caveats

Memory usage isn't optimal

How do we efficiently handle ingesting in time order. The sort key for the events table in ClickHouse concatenates the event timestamp to a day so we cannot order efficiently by timestamp. I'm not sure on the implications of e.g. if we decide to order by timestamp instead of toDate(timestamp) on memory usage. It might be that it manages to, as it's streaming through only need to order the events in a day at any one time thus making it scale with date range increases not so badly although if you have loads of data within one day it could still be an issue. If it turns out it doesn't scale so well, you can always run this multiple times with different incrementing day ranges.

To avoid migrations scaling too badly we end up doing multiple queries using LIMIT to keep the memory usage down. Assuming there's not too many rows in a single day for which the timestamp, uuid cursor needs to be calculated it should be ok.

Note that increasing --fetch-limit may reduce the load on the cluster due to the reduced duplication of query execution, assuming you've got enough memory.

Performance isn't as good as it should be

Due to the above limitation around time ordering, we need to perform some expensive queries which do not support streaming and as such it is not as performant as I would like. It's possible that we could either:

  1. write the data to a new table with a refined sort key, although this risks e.g. disk space issues.
  2. offer a way to drop the time ordering requirement. Perhaps some do not need to worry about this.
  3. identify a subset of events that need to be in time order and do these first, then do the rest. This will however break point in time person properties that are denormalized onto events at ingestion time.

posthog-migration-tools's People

Contributors

bretthoerner avatar hazzadous avatar tiina303 avatar tomasfarias avatar xvello avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

quorak

posthog-migration-tools's Issues

Migration ends with a session not closed error

Expected behaviour:

Migration ends without any error thrown if successful.

Current behaviour:

Throws an error at the end even if successful

ERROR:asyncio:Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x7fc991b3d760>
ERROR:asyncio:Unclosed connector
connections: ['[(<aiohttp.client_proto.ResponseHandler object at 0x7fc991022d00>, 467526.428350588)]']
connector: <aiohttp.connector.TCPConnector object at 0x7fc991b3d880>```

migrate.py script don't migrate any events

After following manual and executing migrate.py script, no event was migrated. Username, password and token in case bellow is different, I altered it for security reasons.

sudo python3 ./migrate.py --clickhouse-url http://IP:8123 --clickhouse-user user --clickhouse-password password --clickhouse-database posthog --team-id ID --posthog-url https://eu.posthog.com --posthog-api-token "token" --start-date 2021-03-15T13:00:00 --end-date 2023-08-25T13:00:00 --fetch-limit 10000
INFO:main:Migrating 0 events
Cursor: None
{
"committed_cursor": null,
"total_events": 0
}

According to clickhouse database, there are events in this time frame:
┌─uuid─────────────────────────────────┬─event──────────────────┬──────────────────timestamp─┬─────────────────created_at─┐
│ 01872255-ebac-0000-8c10-b2bf4f32dee9 │ JPO/logged_in │ 1970-01-01 00:00:10.530000 │ 2023-03-27 09:11:08.999000 │
│ 0186ffeb-1398-0000-b3ba-ba5e848b8acf │ WEB/store/cart_entered │ 1970-01-01 00:00:14.237000 │ 2023-03-20 16:47:21.521000 │
│ 01783444-98c0-0000-d306-9777957acbc8 │ $pageview │ 2021-03-15 05:03:30.473000 │ 2022-10-14 11:09:46.243000 │
│ 01783444-971f-0000-0d96-2e91515769cb │ $autocapture │ 2021-03-15 05:03:30.656000 │ 2022-10-14 11:10:47.609000 │
│ 01783444-98d6-0000-d148-f39117b3081a │ JMP/visited_jdba_page │ 2021-03-15 05:03:31.035000 │ 2022-10-14 11:09:46.284000 │
│ 01783444-972b-0000-fa6e-ecbba65bbe8b │ $autocapture │ 2021-03-15 05:03:32.572000 │ 2022-10-14 11:10:47.624000 │
│ 01783444-9740-0000-3f43-28e28500ab89 │ $autocapture │ 2021-03-15 05:03:32.575000 │ 2022-10-14 11:10:47.638000 │
│ 01783444-a2fd-0000-bae6-90aecae55982 │ $pageleave │ 2021-03-15 05:03:36.682000 │ 2022-10-14 11:10:47.658000 │
│ 01783444-ae7b-0000-1f68-76b6aa5691a8 │ $autocapture │ 2021-03-15 05:03:39.169000 │ 2022-10-14 11:10:47.672000 │
│ 01783444-dd84-0000-f3d4-debaaef6ace0 │ $autocapture │ 2021-03-15 05:03:50.110000 │ 2022-10-14 11:10:47.688000 │
└──────────────────────────────────────┴────────────────────────┴────────────────────────────┴────────────────────────────┘

Feature suggestion: Skip events on migration

Sometimes when you migrate from your test instance you know that certain events you created will never be needed on the new system.

So during our migration we worked with a fixed filter list in the SQL query part (AND event NOT IN ('event_name_a', 'event_name_b', 'event_name_c')). Might be more usable for others if they can add them as parameters.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.