Giter VIP home page Giter VIP logo

cognite-sdk-python's Introduction

Cognite logo

Cognite Python SDK

build Downloads GitHub codecov Documentation Status PyPI version conda version mypy Code style: black

This is the Cognite Python SDK for developers and data scientists working with Cognite Data Fusion (CDF). The package is tightly integrated with pandas, and helps you work easily and efficiently with data in Cognite Data Fusion (CDF).

Reference documentation

Installation

Without any optional dependencies

To install the core version of this package:

$ pip install cognite-sdk

With optional dependencies

A number of optional dependencies may be specified in order to support a wider set of features. The available extras (along with the libraries they include) are:

  • numpy [numpy]
  • pandas [pandas]
  • geo [geopandas, shapely]
  • sympy [sympy]
  • functions [pip]
  • yaml [PyYAML]
  • all [numpy, pandas, geopandas, shapely, sympy, pip, PyYAML]

To include optional dependencies, specify them like this with pip:

$ pip install "cognite-sdk[pandas, geo]"

or like this if you are using poetry:

$ poetry add cognite-sdk -E pandas -E geo

Performance notes

If you regularly need to fetch large amounts of datapoints, consider installing with numpy (or with pandas, as it depends on numpy) for best performance, then use the retrieve_arrays (or retrieve_dataframe) endpoint(s). This avoids building large pure Python data structures, and instead reads data directly into memory-efficient numpy.ndarrays.

Windows specific

On Windows, it is recommended to install geopandas and its dependencies using conda package manager, see geopandas installation page. The following commands create a new environment, install geopandas and cognite-sdk.

conda create -n geo_env
conda activate geo_env
conda install --channel conda-forge geopandas
pip install cognite-sdk

Changelog

Wondering about upcoming or previous changes to the SDK? Take a look at the CHANGELOG.

Migration Guide

To help you upgrade your code(base) quickly and safely to a newer major version of the SDK, check out our migration guide. It is a more focused guide based on the detailed change log. MIGRATION GUIDE.

Contributing

Want to contribute? Check out CONTRIBUTING.

cognite-sdk-python's People

Contributors

andreavs avatar buggambit avatar ddonukis avatar dependabot[bot] avatar doctrino avatar erlendvollset avatar greenbech avatar haakonvt avatar hmeiding avatar johanlrdahl avatar kennethskaar avatar mathialo avatar me-ydv-5 avatar nimeshawij avatar olacognite avatar psalaberria002 avatar qtiptip avatar quecognite avatar renovate[bot] avatar sanderland avatar sighol avatar silvavelosa avatar sondreso avatar stianlagstad avatar tapped avatar tuanng-cognite avatar verstraetebert avatar vincent-cognite avatar vvemel avatar wjoel avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cognite-sdk-python's Issues

numpy.int64 not accepted by sdk

Seems like CDP doesn't know how to handle int64. When pushing data that is type numpy.int64 I get an error. Changing it to float or to int fixes the issue. Several others have encountered the same issue since numpy.int64 is a very common type used. It would be good to have this type supported by the python-sdk.

image 1

get_events ignores keyword arguments such as max_start_time when asset_id is given

Describe the bug
As a data scientist, I would like get_events to take in an asset_id together with additional arguments such as min_start_time, max_start_time, or type at the same time, because I need events to be filtered down to specific assets, time periods, and types.

I suspect the issue to be caused by line 136 in events.py:
If an asset_id is given the keyword arguments are not taken into account.

To Reproduce

# set up
import pandas as pd
import numpy as np
import os
import cognite
import platform
from datetime import datetime, timedelta, date
import matplotlib.pyplot as plt

# connect to API
AKER_API_KEY = os.environ['AKER_API']
client = cognite.CogniteClient(api_key = AKER_API_KEY)

# take one random valve from SKARV
tag = '13-ESV-6001'
min_start = int(datetime(2019,2,28).timestamp()*1000)
valves = client.assets.search_for_assets(name = tag).to_pandas()
valve_1 = valves.loc[valves['name'] == tag]
valve_1_id = valve_1['id'].loc[0]

# try to filter down data for the valve that is not older that February 28 2019
ce1 = client.events.get_events(
    asset_id = valve_1_id,
    type = 'Workorder',
    min_start_time = min_start,
    limit = 1000).to_pandas()

# leave asset_id out
ce2 = client.events.get_events(
    type = 'Workorder',
    min_start_time = min_start,
    limit = 1000).to_pandas()

# plot results
# clearly, ce1 was not filtered by date, while ce2 was
fig, ax = plt.subplots(1,2)
ax[0].hist(ce1['startTime'].tolist())
ax[0].axvline(min_start,color = 'red')
ax[1].hist(ce2['startTime'].tolist())
ax[1].axvline(min_start,color = 'red')
plt.draw()

# also: ce1 contains a lot of types while ce2 only contains 'Workorder' as specified
print(ce1['type'].unique())
print(ce2['type'].unique())

Expected behavior
The ce1 DataFrame should not contain any records earlier than min_start and no other types than 'Workorder'.

Consider layering the SDK to decrease size

Pandas and numpy are heavy dependencies. We should consider splitting the SDK into a very thin layer that communicates with the API, and a thicker layer which includes utilities such as pandas.

DataTransferService creates new client by default

The DataTransferService creates its own client, and users are easily confused into connecting to two different projects simultaneously.

Suppose I have the following variables set:

export COGNITE_API_KEY=xyz789 # for default_project
export MY_PROJECT_API_KEY=abc123 # For my_project

Then I naively read the documentation and do this:

from cognite import CogniteClient
from cognite.data_transfer_service import DataTransferService

client = CogniteClient(api_key=os.environ.get('MY_PROJECT_API_KEY'), project='my_project')
dts = DataTransferService()  # Will actually access 'default_project' with COGNITE_API_KEY!

I'm actually accessing two different projects with two different api keys, though this is not immediately apparent. To work around this, I have to always remember to explicitly pass api_key when creating a DataTransferService instance. That's a burden we should not put on our users.

Suggested fixes:

  1. If using COGNITE_API_KEY, issue a warning like Warning: Using the COGNITE_API_KEY environment variable to connect to the CDP project 'default_project'
  2. Change that signature of DataTransferService to take client as an optional parameter, and not api_key, cookies or num_of_workers. If no client is given, a new one is created.
  3. Implement a getter client.data_transfer_service that returns DataTransferService(client=self).
  4. Extend the documentation for DataTransferService to explicitly mention that a new client is created if none is given, and suggest to use cognite.data_transfer_service instead.

DatapointsResponse.to_pandas() loses time series name information

In order to get the name one currently has to first use DatapointsResponse.to_json()["name"]. This is inconvenient and I guess also kind of wasteful. Particularly in settings where one wants to get raw data points from several time series and subsequently merge them into a common data frame. Suggested fix is to either rename the column in pandas to the name of the time series (works for api 0.5, but not good for api 0.6) or increase the amount of meta data available on the drp object. E.g. add time series name and time series id to the object by default.

Should handle pagination

When for example using get_assets then maybe the function should not only return the first 1000 elements but automatically return all assets.

post_datapoints_frame is undocumented

post_datapoints_frame(dataframe) does not take a time series name as an argument, and does not have an example of how to use the function. @trygvekk mentioned that the name of the data frame is accepted as a data frame header, but he did not have an example.

Create standard methods for performing CRUD operations

These standard methods should be resource-type agnostic. As the API specs are consolidated for v1 of the API, these endpoints will be consistent enough to create generic methods for get, create, list, update, delete, search.

get_multi_time_series_datapoints returns an iterator rather than a list

The documentation says

Returns:
list(stable.datapoints.DatapointsResponse): A list of data objects containing the requested data with several getter methods
with different output formats.
"""

But the actual return value is
return DatapointsResponseIterator([DatapointsResponse(result) for result in results])

Add flag to include/exclude time series metadata in TimeSeriesResponse.to_pandas()

When getting multiple time series by ID, and they have very different metadata, it would be nice to choose whether or not all metadata fields should be separate columns in the resulting dataframe (when calling to_pandas()). I.e. something like to_pandas(show_metadata=False), which would result in a Pandas dataframe with a single column containing the metadata as dict.

`type` argument doesn't work for `events.get_events` method

Describe the bug
When you are trying to get events with specified type argument, it returns not only events with specified type

To Reproduce

from cognite.client import CogniteClient

cognite_client = CogniteClient()
asset_id = 8040340116462668

cursor_id = None
all_events = list()
while True:
  events_response = cognite_client.events.get_events(type="Workorder", asset_id=asset_id, autopaging=True)
  events = events_response.to_json()
  cursor_id = events_response.next_cursor()
  
  all_events.extend(events)
  
  if cursor_id is None:
    break


assert all([event['type'] == "Workorder" for event in all_events]), "Events contains not only `Workorders` events"

Expected behavior
All events should have type='Workorder'

Screenshots
In context of previous code snippet
image

Additional context
Also, I have the same process with type='Isolation_Certificate and it's not working

Asset hierarchy visualization

Feature idea: Visualize asset hierarchy

As a data scientist I want to visualize, in a notebook, the asset hierarchy graph in the near vicinity of a certain node

Inspiration

Open questions

  • Does this belong here or in ML hosting? We want customers to be able to use this
  • Can this extend into a toolset for evaluating asset hierarchies?

Process leaking due to flawed process lifetime management

Due to how process lifetime is managed in cognite-sdk-python, spawned processes are not guaranteed to be terminated by the parent process, and may in some cases lead to "process leaks" (analogous to memory leaks). For an example of where this may happen, see this link. Note that this is not the only case where this pattern appears in the repository.

For example, if the call to Pool.map throws an exception, the part terminating the process(es) will never be called, and you have a "process leak".

This could be solved either by wrapping the code in try...finally or, preferably, by using a context manager.

Adapt to 0.5 endpoint changes

Breaking changes

Storage/Files

  • /storage (Cloud Storage) endpoints renamed to /files (Files)
  • GET /api/0.5/projects/{project}/storage/{id}/info moved to /api/0.5/projects/{project}/files/{id} . Now returns a list with a single file.
  • GET /api/0.5/projects/{project}/storage/{id} moved to /api/0.5/projects/{project}/files/{id}/downloadlink

Assets

  • GET /api/0.5/projects/{project}/assets/{id} moved to /api/0.5/projects/{project}/assets/{id}/subtree
  • GET /api/0.5/projects/{project}/assets/{id} now returns a list with a single asset.

Events

  • GET /api/0.5/projects/{project}/events/{eventId} returns a list now with a single element

CogniteClient.get throws exception

When I run

from cognite import CogniteClient
CogniteClient().get('/login')

I get the following error:

...
File ".../cognite/client/cognite_client.py", line 164, in get
return self._api_client._get(url, params, headers)
File ".../cognite/client/_api_client.py", line 61, in wrapper
res = method(client_instance, full_url, *args, **kwargs)
TypeError: _get() got multiple values for argument 'headers'

The error is caused by the default params and header arguments (with the value of None) being passed as positional arguments to wrapper, which then adds keyword arguments before calling the wrapped _get method.

This problem probably applies to other functions wrapped by request_method, like post and delete.

I worked around the bug by bypassing the wrapper with CogniteClient()._api_client._get('/login').

Use Retry-After header after 429/503

Is your feature request related to a problem? Please describe.
We can some 429 issues from backend, and we should use their suggested delay before next request.

Describe the solution you'd like
Use this header value as a parameter to the retry-function.

Fix autocompletion issues in CogniteClient

Currently we expose sub-clients using the @Property decorator. This breaks auto-completion in Jupyter. If we expose the attributes from the constructor directly we break auto-completion in other IDEs (such as PyCharm).

This can be fixed by not using the client factory AND exposing the clients directly as attributres from the constructor.

Error in example code

In Cognite SDK documentation in section Datapoints -> post_datapoints this Python code generate an error:

from cognite.client.stable.datapoints import Datapoint
client = CogniteClient()
start = 1514761200000
my_dummy_data = [Datapoint(timestamp=ms, value=i) for i, ms in range(start, start+100)]
client.datapoints.post_datapoints(my_dummy_data)

TypeError: cannot unpack non-iterable int object

Also post_datapoints should get two parameters instead of one in example.

client.time_series.get_time_series does not return metadata

Describe the bug
When executing client.time_series.get_time_series() with include_metadata = True no metadata is returned.

To Reproduce
Runnable code reproducing the error.

import cognite
import requests
import os
import numpy as np
import pandas as pd
from datetime import datetime, timedelta

from cognite.client.stable.time_series import TimeSeries
sm_api = os.environ['SM_API_KEY']
client = cognite.CogniteClient(api_key = sm_api)
ts_name = 'Test_tssssss'
my_time_series = [TimeSeries(name=ts_name, 
                             description = 'test_description',
                             metadata = { 'ASSETSCOPENAME' : 'meta_test_1' })]
client.time_series.post_time_series(my_time_series)

# create dummy data
np.random.seed(1338)
start_time = int((datetime.now()-timedelta(1)).strftime("%s"))
timestamps = [(start_time + i * 10)*1000 for i in np.arange(11)]
df = pd.DataFrame({'timestamp' : timestamps})
df[ts_name] = np.random.random(df.shape[0])
client.datapoints.post_datapoints_frame(df)
# get time_series
ts1 = client.time_series.get_time_series(name = ts_name,
                                         include_metadata = True).to_pandas()
ts1_id = ts1['id'].loc[0] 
print(ts1.loc[0])
# no meta data
# requests:
# first with no metadata
r1 = requests.get(url = 'https://api.cognitedata.com/api/0.5/projects/smart-maintenance-sandbox/timeseries/' + str(ts1_id) ,
                  headers= { 'Api-Key' : sm_api} , params = {"includeMetadata" : False})
print(r1.text.split('\n'))
# then with metadata
r1 = requests.get(url = 'https://api.cognitedata.com/api/0.5/projects/smart-maintenance-sandbox/timeseries/' + str(ts1_id) ,
                  headers= { 'Api-Key' : sm_api} , params = {"includeMetadata" : True})
print(r1.text.split('\n'))

Expected behavior
The client.time_series.get_time_series(name = ts_name,include_metadata = True) should return the metadata.

LatestDatapointResponse.to_json() throws IndexError for time series without data points

When I run

client.datapoints.get_latest('ts_without_datapoints').to_json()

I get the exception

.../cognite/client/stable/datapoints.py in to_json(self)
    114     def to_json(self):
    115         """Returns data as a json object"""
--> 116         return self.internal_representation["data"]["items"][0]
    117 
    118     def to_pandas(self):

IndexError: list index out of range

Where

self.internal_representation = {'data': {'items': []}}

This is confusing, and at first glance seemed like a bug in the SDK. A more user friendly approach would be to either:

  • throw a custom exception e.g. NoDatapointsException when calling get_latest
  • return None from to_json()

I would prefer the former, or optionally have a flag to control the behavior.

Create standard methods for performing CRUD operations

These standard methods should be resource-type agnostic. As the API specs are consolidated for v1 of the API, these endpoints will be consistent enough to create generic methods for get, create, list, update, delete, search.

Parameters and attributes mix snake_case and camelCase for same class

In all classes which represent a resource which will be passed to the api, the parameters use snake_case while the attributes on the object object use camelCase. This is confusing for the user and results in errors that are difficult to debug.

The reason this is done, is so that the client can simply access the dict attribute on the object to convert it to a suitable json format without having to convert the keys to camelCase first.

This behaviour should change so that we consistenly use snake_case in the sdk, and convert to camelCase only when actually sending the object to the API.

Inconsistency in cognite.v05 namespace

cognite.v05.assets is not included in the namespace but cognate.v05.timeseries is. Should we not choose to go with both not being there or both being there?

Create Base DTOs

Currently we separate between the DTO which is used for writing and used for reading resources.
e.g. Asset, AssetResponse, and AssetListResponse.

This should be consolidated into a single DTO. We should probably still keep the List DTO so that we can have helper methods on this object like to_pandas().

All DTOs should have the following properties:

  • .to_pandas()
  • .__str__()
  • .__repr__()
  • .__eq__()
  • ._load()
  • ._dump(camel_case: bool)
  • All properties of the respective resource

ListDTOs should have the following properties:

  • .to_pandas()
  • .__eq__()
  • .__str__()
  • .__repr__()
  • ._load()
  • all list properties (.__getitem__, .__len__, .__iter__, .__next__)
  • All properties of the respective resource

Note:

  • to_pandas() should lazy load pandas dependency to support sdk-core package
  • __str__ and __repr__ should both return pretty-printable representations of the resource

Get a specific time series using name or id

Is your feature request related to a problem? Please describe.
I often have a time series id or name and would like to look up the metadata of the time series.
I'd like to be able to get a timeseries by id and by name.

Describe the solution you'd like
Two additional access patterns:

  • time_series.get_time_series(name='TIMESERIESNAME')
  • time_series.get_time_series(id=1234567890123)

Which return an error if the time series is found.

Describe alternatives you've considered
The prefix parameter of time_series.get_time_series works, but this is a search, that will return multiple results. Taking the first result in the response will lead to logical errors if done without additional checks.

Additional context
Open quesion: are time series names unique in CDP, or is it just the id?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.