Giter VIP home page Giter VIP logo

dcaribou / transfermarkt-datasets Goto Github PK

View Code? Open in Web Editor NEW
219.0 10.0 54.0 3.82 MB

⚽️ Extract, prepare and publish Transfermarkt datasets.

Home Page: https://www.kaggle.com/datasets/davidcariboo/player-scores

License: Creative Commons Zero v1.0 Universal

Jupyter Notebook 43.18% Python 48.32% Shell 1.14% HCL 3.94% Dockerfile 0.49% Makefile 2.34% Mermaid 0.60%
soccer-analytics football-data football analytics dataset dbt

transfermarkt-datasets's Introduction

Build Status Pipeline Status dbt Version

transfermarkt-datasets

In an nutshell, this project aims for three things:

  1. Acquiring data from the transfermarkt website using the trasfermarkt-scraper.
  2. Building a clean, public football (soccer) dataset using data in 1.
  3. Automating 1 and 2 to keep assets up to date and publicly available on some well-known data catalogs.

Open in GitHub Codespaces Kaggle data.world


classDiagram
direction LR
competitions --|> games : competition_id
competitions --|> clubs : domestic_competition_id
clubs --|> players : current_club_id
clubs --|> club_games : opponent/club_id
clubs --|> game_events : club_id
players --|> appearances : player_id
players --|> game_events : player_id
players --|> player_valuations : player_id
games --|> appearances : game_id
games --|> game_events : game_id
games --|> clubs : home/away_club_id
games --|> club_games : game_id
class competitions {
 competition_id
}
class games {
    game_id
    home/away_club_id
    competition_id
}
class game_events {
    game_id
    player_id
}
class clubs {
    club_id
    domestic_competition_id
}
class club_games {
    club_id
    opponent_club_id
    game_id
}
class players {
    player_id
    current_club_id
}
class player_valuations{
    player_id
}
class appearances {
    appearance_id
    player_id
    game_id
}
Loading


📥 setup

🔈 New! → Thanks to Github codespaces you can now spin up a working dev environment in your browser with just a click, no local setup required.

Open in GitHub Codespaces

Setup your local environment to run the project with poetry.

  1. Install poetry
  2. Install python dependencies (poetry will create a virtual environment for you)
cd transfermarkt-datasets
poetry install

Remember to activate the virtual environment once poetry has finished installing the dependencies by running poetry shell.

make

The Makefile in the root defines a set of useful targets that will help you run the different parts of the project. Some examples are

dvc_pull                       pull data from the cloud
docker_build                   build the project docker image and tag accordingly
acquire_local                  run the acquiring process locally (refreshes data/raw/<acquirer>)
prepare_local                  run the prep process locally (refreshes data/prep)
sync                           run the sync process (refreshes data frontends)
streamlit_local                run streamlit app locally

Run make help to see the full list. Once you've completed the setup, you should be able to run most of these from your machine.

💾 data storage

All project data assets are kept inside the data folder. This is a DVC repository, so all files can be pulled from remote storage by running dvc pull.

path description
data/raw contains raw data for different acquirers (check the data acquisition section below)
data/prep contains prepared datasets as produced by dbt (check data preparation)

🕸️ data acquisition

In the scope of this project, "acquiring" is the process of collecting data from a specific source and via an acquiring script. Acquired data lives in the data/raw folder.

acquirers

An acquirer is just a script that collect data from somewhere and puts it in data/raw. They are defined in the scripts/acquiring folder and run using the acquire_local make target. For example, to run the transfermarkt-api acquirer with a set of parameters, you can run

make acquire_local ACQUIRER=transfermarkt-api ARGS="--season 2023"

which will populate data/raw/transfermarkt-api with the data it collected. Obviously, you can also run the script directly if you prefer.

cd scripts/acquiring && python transfermarkt-api.py --season 2023

🔨 data preparation

In the scope of this project, "preparing" is the process of transforming raw data to create a high quality dataset that can be conveniently consumed by analysts of all kinds.

Data prepartion is done in SQL using dbt and DuckDB. You can trigger a run of the preparation task using the prepare_local make target or work with the dbt CLI directly if you prefer.

  • cd dbt → The dbt folder contains the dbt project for data preparation
  • dbt deps → Install dbt packages. This is only required the first time you run dbt.
  • dbt run -m +appearances → Refresh the assets by running the corresponding model in dbt.

dbt runs will populate a dbt/duck.db file in your local, which you can "connect to" using the DuckDB CLI and query the data using SQL.

duckdb dbt/duck.db -c 'select * from dev.games'

dbt

⚠️ Make sure that you are using a DukcDB version that matches that that is used in the project.

python api

A thin python wrapper is provided as a convenience utility to help with loading and inspecting the dataset (for example, from a notebook).

# import the module
from transfermarkt_datasets.core.dataset import Dataset

# instantiate the datasets handler
td = Dataset()

# load all assets into memory as pandas dataframes
td.load_assets()

# inspect assets
td.asset_names # ["games", "players", ...]
td.assets["games"].prep_df # get the built asset in a dataframe

# get raw data in a dataframe
td.assets["games"].load_raw()
td.assets["games"].raw_df 

The module code lives in the transfermark_datasets folder with the structure below.

path description
transfermark_datasets/core core classes and utils that are used to work with the dataset
transfermark_datasets/tests unit tests for core classes
transfermark_datasets/assets perpared asset definitions: one python file per asset

For more examples on using transfermark_datasets, checkout the sample notebooks.

👁️ frontends

Prepared data is published to a couple of popular dataset websites. This is done running make sync, which runs weekly as part of the data pipeline.

🎈 streamlit

There is a streamlit app for the project with documentation, a data catalog and sample analyisis. The app is currently hosted in fly.io, you can check it out here deployment is currently disabled until this is resolved.

For local development, you can also run the app in your machine. Provided you've done the setup, run the following to spin up a local instance of the app

make streamlit_local

⚠️ Note that the app expects prepared data to exist in data/prep. Check out data storage for instructions about how to populate that folder.

🏗️ infra

Define all the necessary infrastructure for the project in the cloud with Terraform.

🎼 orchestration

The data pipeline is orchestrated as a series of Github Actions workflows. They are defined in the .github/workflows folder and are triggered by different events.

workflow name triggers on description
build* Every push to the master branch or to an open pull request It runs the data preparation step, and tests and commits a new version of the prepared data if there are any changes
acquire-<acquirer>.yml Schedule It runs the acquirer and commits the acquired data to the corresponding raw location
sync-<frontend>.yml Every change on prepared data It syncs the prepared data to the corresponding frontend

*build-contribution is the same as build but without commiting any data.

💡 Debugging workflows remotelly is a pain. I recommend using act to run them locally to the extent that is possible.

💬 community

📞 getting in touch

In order to keep things tidy, there are two simple guidelines

  • Keep the conversation centralised and public by getting in touch via the Discussions tab.
  • Avoid topic duplication by having a quick look at the FAQs

🫶 sponsoring

Maintenance of this project is made possible by sponsors. If you'd like to sponsor this project you can use the Sponsor button at the top.

→ I would like to express my grattitude to @mortgad for becoming the first sponsor of this project.

👨‍💻 contributing

Contributions to transfermarkt-datasets are most welcome. If you want to contribute new fields or assets to this dataset, the instructions are quite simple:

  1. Fork the repo
  2. Set up your local environment
  3. Populate data/raw directory
  4. Start modifying assets or creating new ones in the dbt project
  5. If it's all looking good, create a pull request with your changes 🚀

ℹ️ In case you face any issue following the instructions above please get in touch

transfermarkt-datasets's People

Contributors

adam-cowley avatar blcklvls avatar dcaribou avatar dcereijodo avatar github-actions[bot] avatar larchliu avatar srdjov18 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

transfermarkt-datasets's Issues

Create a Kaggle dataset

Create a Kaggle dataset with a 'prep' appearances file and set up automatic weekly updates for that file

Initial version will be for ES1 (Spanish league) only

cloning fails due to invalid path

Hi, I tried to clone the repository with git clone --recursive https://github.com/dcaribou/transfermarkt-datasets.git, but I get the following message:

Cloning into 'transfermarkt-datasets'...
remote: Enumerating objects: 1467, done.
remote: Counting objects: 100% (605/605), done.
remote: Compressing objects: 100% (303/303), done.
remote: Total 1467 (delta 354), reused 461 (delta 288), pack-reused 862
Receiving objects: 100% (1467/1467), 2.94 MiB | 10.73 MiB/s, done.
Resolving deltas: 100% (697/697), done.
error: invalid path 'streamlit/pages/03_📈_analysis:_player_value.py'
fatal: unable to checkout working tree
warning: Clone succeeded, but checkout failed.
You can inspect what was checked out with 'git status'
and retry with 'git restore --source=HEAD :/'

Even though is says clone succeeded, the transfermarkt-datasets folder is actually empty apart from the .git folder.

Any idea what could be the issue here? Any help would be much appreciated!

Add games and appearances for non-domestic competitions

We are currently supporting a subset of all available leagues and competitions in Tansfermarket (see below). We should extend the dataset with the additional leagues and competitions in general.

Depends on dcaribou/transfermarkt-scraper#17

Region League Code Country Availablilty First Season
Europe ES1 Spain current
Europe GB1 England current
Europe L1 Germany current
Europe NL1 Netherlands current
Europe DK1 Denmark current
Europe TR1 Turkey current
Europe GR1 Greece current
Europe IT1 Italy current
Europe BE1 Belgium current
Europe PO1 Portugal current
Europe FR1 France current
Europe RU1 Russia current
Europe UKR1 Ukraine current
Europe SC1 Scotland current
America All All -
Asia All All -
Africa All All -

First name // Last name in different columns

Moved from dcaribou/transfermarkt-scraper#47

Hi again,

I've noticed that Transfermarkt formats first name and last names differently in their pages- last names are given the "strong" tag, whereas first names are not

It would be great to gather these, for a number of reasons:

It's tidier to separate out the two variables (first name, last name) into different columns
as the examples above illustrate, it's not always clear where the first name(s) end and the last name(s) start
names sometimes contain dashes, which seem to be missed by the current approach

207570934-b62db823-b6c4-42ab-958d-b60906142001

Decode HTML special characters

Some entities names contains special characters, which HTML encodes with a string sequence. These sequences are part of the URLs and, if they are not decoded, the end as part of the entities attributes.

For example, special characters in URL https://www.transfermarkt.co.uk/sergej-milinkovi%C4%87-savi%C4%87/profil/spieler/266302 end up in the name of the player in the players.csv file

266302,sergej-milinkovi%C4%87-savi%C4%87,398,midfield - Central Midfield,https://www.transfermarkt.co.uk/sergej-milinkovi%C4%87-savi%C4%87/profil/spieler/266302

We should decode this special sequences to include the correct characters in the file

Explore alternatives to Heroku for streamlit app hosting

Heroku is discontinuing the free plan by the end of November 2022. The next most economic plan, which seems to be Eco, starts at 5$ per month, which for a low traffic application such as our streamlit app is not perfect.

Explore alternatives to Heroku that can adapt better to the needs of the app.

Name Docker support* Elastic pricing** AWS Ireland Custom domains Test successful
Heroku Eco Yes No Yes Yes
Heroku Basic Yes No Yes Yes
Streamlit Cloud No ❌
Cyclic No ❌
Deta No ❌
Fly.io 🏆 Yes Yes No Yes Yes
Render Yes Yes No Yes No
Railway Yes Yes us-west-1 only - No
Porter
Clever Cloud
PythonAnywhere

*Docker support means we have the ability to create a custom Docker image for the app, and define a launch command **Elastic pricing means either free hosting or a cheap pay as you go option with scale down to zero

Update dataset ER diagram

The ER diagram that is currently displayed in the README is missing the player valuations asset.
As more assets are added (and even for correctly displaying the relationships for the exiting assets) it will get harder to manually keep the diagram up to date.

Is there a way to generate ER diagrams dynamically?

Additional appearance metrics - passes, shots, interceptions

This request was raised here by @batamayo59.

It would be interesting to add some additional fields to the players data set like:

  • number of passes total and completed
  • shots on target / off target
  • number of interceptions

With this we could do a broader analysis of some factors that have helped teams to win matches, also we could run scores by player and define the number of opportunities per match, with the current information available we just can compare goals scored vs goals allowed

Setup public access to DVC assets

In order for everyone to be able to access DVC assets and appeareces snapshots, allow public access on the bucket somehow

A few things to consider

AWS Budget actions does not support triggering a change in the ACL of an S3 bucket (the way to change the status of a bucket from public to private). Explore the option of setting up a simple lambda that subscribes to AWS Budget events to do the thing. Useful resources

  • Setting up a lambda with Terraform (link)
  • Setting up a subscription from SNS to a lambda with Terraform (link)
  • Lambda deployment packages (link)
  • Lambda development tools (link)
  • Terraform example (link)

Fix incorrect columns

Crad columns (yellow_cards and red_cards) are obviously incorrect.

Captura de pantalla 2021-01-03 a las 17 21 18

Goals and assists seem to be equal

Captura de pantalla 2021-01-08 a las 19 25 09

Add freshness checks

Avoid regressions such as this one by implementing recency / freshness checks for assets.

There does not seem to be a built in check for recency in frictionless, hence this probably means to add a new custom check in transfermarkt_datasets/core/checks.py

Missing current/max market value in players asset

Reported here.

What I've noticed so far:

  • Raw data contains duplicated rows for many players (i.e player id 181177)
  • One of these duplicates contains the expected market value and the other one does not
  • The deduplication logic appears to be taking the records with the missing values

Most likely this problem originated after updating the acquiring step to collect historical market value. The historical market value (player_valuations asset) is working as expected, but the max/current in the player asset got broken.

Enable self-serve DVC backend access granting

In order to be able to pull data from DVC remote storage, potential contributors need to be able to List and Get objects from S3 bucket s3://player-scores. Enable users to self-grant permissions for this actions by creating a PR adding their IAM user to a whitelist in the Terraform configuration.

Create a terraform whitelist that renders a aws_iam_policy_document that can be attached to the bucket
https://github.com/dcaribou/player-scores/blob/49385db77088718fdf5344d4f8b1cf7b20da8f19/infra/base/main.tf#L14

Extend dataset with historical data

This issue is for tracking the addition of historical data to the datasets. The acquiring and processing scripts have been extended to handle multiple seasons by means of PRs

With that, it should be possible to add historical data by running the scripts with the necessary arguments.

Add `player_agent` to the `players` file

This attribute is already being scraped, and it can easily be added to the dataset.

https://github.com/dcaribou/transfermarkt-scraper/blob/ba1f2e61ae26dd9e492433b51d955b7774919d4d/tfmkt/spiders/players.py#L70

Sample raw data:

{"player_agent": {"href": "/spocs-global-sports/beraterfirma/berater/728", "name": "SPOCS Global Sports"}}
{"player_agent": {"href": "/alessandro-lucci-wsa/beraterfirma/berater/529", "name": "Alessandro Lucci - WSA"}}
{"player_agent": {"href": "/key-united/beraterfirma/berater/2654", "name": "Key United"}}

Discussion here

Show validation error messages

Context

A FailedAssetValidation is raised whenever a validation error happens for an asset in during preparation. This exception, though, does not currently display the error message, which means that in order to understand what the problem is we need to look at the files with the validation report that are produced in transfermarkt_datasets/

Solution

The FailedAssetValidation exception could contain a error message that summarised the validation report, which would allow understanding what the error could be from the logs.

Consider validation results in CI workflows

After #16, the results of the validations are simply informative and are not being considered when synching the dataset. We should consider these results at least to avoid breaking changes.

The system cannot find the file specified

--> Acquiring games
Traceback (most recent call last):
File "C:\Users\Mehdi\AppData\Local\Programs\Python\Python39\lib\site-packages\docker\api\client.py", line 214, in _retrieve_server_version
return self.version(api_version=False)["ApiVersion"]
File "C:\Users\Mehdi\AppData\Local\Programs\Python\Python39\lib\site-packages\docker\api\daemon.py", line 181, in version
return self._result(self._get(url), json=True)
File "C:\Users\Mehdi\AppData\Local\Programs\Python\Python39\lib\site-packages\docker\utils\decorators.py", line 46, in inner
return f(self, *args, **kwargs)
File "C:\Users\Mehdi\AppData\Local\Programs\Python\Python39\lib\site-packages\docker\api\client.py", line 237, in _get
return self.get(url, **self._set_request_timeout(kwargs))
File "C:\Users\Mehdi\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\sessions.py", line 555, in get
return self.request('GET', url, **kwargs)
File "C:\Users\Mehdi\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\sessions.py", line 542, in request
resp = self.send(prep, **send_kwargs)
File "C:\Users\Mehdi\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\sessions.py", line 655, in send
r = adapter.send(request, **kwargs)
File "C:\Users\Mehdi\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\adapters.py", line 439, in send
resp = conn.urlopen(
File "C:\Users\Mehdi\AppData\Local\Programs\Python\Python39\lib\site-packages\urllib3\connectionpool.py", line 699, in urlopen
httplib_response = self._make_request(
File "C:\Users\Mehdi\AppData\Local\Programs\Python\Python39\lib\site-packages\urllib3\connectionpool.py", line 394, in _make_request
conn.request(method, url, **httplib_request_kw)
File "C:\Users\Mehdi\AppData\Local\Programs\Python\Python39\lib\http\client.py", line 1279, in request
self._send_request(method, url, body, headers, encode_chunked)
File "C:\Users\Mehdi\AppData\Local\Programs\Python\Python39\lib\http\client.py", line 1325, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "C:\Users\Mehdi\AppData\Local\Programs\Python\Python39\lib\http\client.py", line 1274, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "C:\Users\Mehdi\AppData\Local\Programs\Python\Python39\lib\http\client.py", line 1034, in _send_output
self.send(msg)
File "C:\Users\Mehdi\AppData\Local\Programs\Python\Python39\lib\http\client.py", line 974, in send
self.connect()
File "C:\Users\Mehdi\AppData\Local\Programs\Python\Python39\lib\site-packages\docker\transport\npipeconn.py", line 29, in connect
sock.connect(self.npipe_path)
File "C:\Users\Mehdi\AppData\Local\Programs\Python\Python39\lib\site-packages\docker\transport\npipesocket.py", line 22, in wrapped
return f(self, *args, **kwargs)
File "C:\Users\Mehdi\AppData\Local\Programs\Python\Python39\lib\site-packages\docker\transport\npipesocket.py", line 71, in connect
raise e
File "C:\Users\Mehdi\AppData\Local\Programs\Python\Python39\lib\site-packages\docker\transport\npipesocket.py", line 51, in connect
handle = win32file.CreateFile(
pywintypes.error: (2, 'CreateFile', 'The system cannot find the file specified.')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Users\Mehdi\Downloads\transfermarkt-datasets-master\transfermarkt-datasets-master\1_acquire.py", line 155, in
acquired_data = acquire_asset(
File "C:\Users\Mehdi\Downloads\transfermarkt-datasets-master\transfermarkt-datasets-master\1_acquire.py", line 106, in acquire_asset
docker_client = docker.from_env()
File "C:\Users\Mehdi\AppData\Local\Programs\Python\Python39\lib\site-packages\docker\client.py", line 96, in from_env
return cls(
File "C:\Users\Mehdi\AppData\Local\Programs\Python\Python39\lib\site-packages\docker\client.py", line 45, in init
self.api = APIClient(*args, **kwargs)
File "C:\Users\Mehdi\AppData\Local\Programs\Python\Python39\lib\site-packages\docker\api\client.py", line 197, in init
self._version = self._retrieve_server_version()
File "C:\Users\Mehdi\AppData\Local\Programs\Python\Python39\lib\site-packages\docker\api\client.py", line 221, in _retrieve_server_version
raise DockerException(
docker.errors.DockerException: Error while fetching server API version: (2, 'CreateFile', 'The system cannot find the file specified.')

Missing appearances for some players in historical seasons

Reported in a data.world Discussion.

Hello, the "appearances" data is not complete. The player with the id "5958" has only 58 appearances in the competition with id "IT1". The correct result is 617.

Using the query

with appearances as (
    SELECT
      games.date,
      games.season,
      games.competition_code,
      appearances.appearance_id,
      players.url
   FROM appearances
   LEFT JOIN players USING(player_id)
   LEFT JOIN games USING(game_id)
   WHERE player_id = 5958
)
SELECT
   season,
   COUNT(*)
FROM appearances
GROUP by season

We get the appearances count for each season

2015 15
2014 36
2016 28

Although data from for seasons 2014 to 2016 is correct, seasons 2012 and 2013 seem to be missing, since this player has appearances at least for season 2013.

Manage infrastructure with Terraform

Including

  • Project scoped users
  • S3 bucket the project data assets, such as DVC remote storage, weekly execution results

Maybe

  • ECS tasks for the weekly execution

Switch scraper season to 2021

After pinning the scraper season to 2020, we need to change the pipeline to switch to season 2021 consistently. Things to have into account

  • We need to change the scraper season to 2021
  • We need to start saving snapshots in S3 in a separate prefix
  • We need to update the fetch step to collect data so that multiple seasons are considered
  • We need to update the prep step so that multiple seasons are considered

Questions

  • Assuming we can parameterize a scraper run for a particular season, we will have disconnected datasets for clubs, players, games and appearances for each season. (done in dcaribou/transfermarkt-scraper#19)
    • How to merge them?
    • Where to merge them? Scraper? Player Scores?

Deal with player transferences during a season

Context

For players that are transferred in the middle of the season, the might have appearances in two different clubs (even within the same league). When this happens, we are not able to correctly identify which club the player was playing for at the time of the appearance.

Example

For example geoffrey-kondogbia moved from to atletico-madrid after playing some games for fc-valencia, which caused early season appearances to be tagged as if he played for atletico-madrid (his current team). They should be fc-valencia, as this is the team he was playing for by the time of the appearance.

How to fix

Fixing this required a change in the scraper to parse for which club a player was playing for in a new column. The problem has been addressed and solved at this PR. Now we need to adapt player-scores to work with this new version of the scaper.

How to validate

In order to find the faulty rows, show appearance for teams that fail validation

print(games_per_season_per_club)
rows = df[(df['season'] == 2018) & (df['club_domestic_competition'] == 'ES1') & (df['competition'] == 'ES1') & (df['player_club_name'] == 'athletic-bilbao')]
pandas.set_option('display.max_rows', rows.shape[0]+1)
print(rows[['date', 'round', 'home_club_name', 'away_club_name', 'game_id', 'player_club_name', 'player_name']].sort_values(by='date'))

Rename repository as `transfermarkt-datasets`

Rename repo as transfermarkt-datasets. This should make it easier for potential contributors to understand the purpose of the repository: Host cleaned datasets scraped from Transfermarkt website.

Things to have into account

  • Update all references to player-scores
    • USER_AGENT
    • Terraform resources
      • Update references
      • Renaming an S3 bucket requires manual operations as described here
    • environment.yml
    • dvc remote storage
    • Github action workflows
    • Playbooks
    • datapackage_description.md
    • README.md (test new references!)
  • Improve documentation
    • Remove initial section with the three phases of the broader scope project
    • Change title in README
    • Change repo description
  • Can we create a diagram such as this one? - http://openfootball.github.io/

Move "prep" pipeline to dagster

Goal: Enable arbitrarily complex data modelling on the dataset

Going forward, we want to enable more complex data modelling, that enables use cases such as #85. Moving the preparation steps to a dagster dag will be of help with this.

Move validations to frictionless

Currently there are a bunch of custom validations defined in the asset runners that could be replaced by built in validators from the frictionless framework. The framework supports custom validations as well, which we could explore to completely move validations to frictionless.

Asset attributes incorrect/outdated

Reported on https://data.world/dcereijo/player-scores/discuss/clubscsv-old-data/k2erbnfs

It looks like some dataset assets attributes are outdated. For example, the clubs assets is showing Pep Guardiola as the FC Bayern manager.
By looking at the raw data for the season, the FC Bayern manager we collect is Julian Nagelsmann, hence my first impression is that the issue seems to be with the asset grouping across multiple seasons

self.prep_df = pandas.concat(self.prep_dfs, axis=0).drop_duplicates(subset='club_id')

add market value history for players

Hi currently the current market value column in players dataset is mostly empty.
Could we have an updated value there and if possible the history of values in another table maybe like values ?
Thank you for sharing all this!

Explore custom app domain registration

With the recent changes affecting Heroku plans, the risk of a potential service hosting migration in the future has arised (#113).
If that happens, the current URL for the app (transfermarkt-datasets.herokuapp.com) would need to change.

In order to isolate the app URL from a (potential) future change in the service hosting provider, one strategy could be to acquire a domain have a DNS service direct the traffic to the app regardless the hosting service. In this issue we explore the available options and costs to this this.

https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/Welcome.html
https://d32ze2gidvkk54.cloudfront.net/Amazon_Route_53_Domain_Registration_Pricing_20140731.pdf

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.