cognitedata / cognite-sdk-python Goto Github PK

View Code? Open in Web Editor NEW

77.0 70.0 27.0 18.2 MB

Cognite Python SDK

Home Page: https://cognite-sdk-python.readthedocs-hosted.com/

License: Apache License 2.0

Python 99.91% Shell 0.04% JavaScript 0.05%

cognite-sdk-python's Introduction

Cognite Python SDK

This is the Cognite Python SDK for developers and data scientists working with Cognite Data Fusion (CDF). The package is tightly integrated with pandas, and helps you work easily and efficiently with data in Cognite Data Fusion (CDF).

Reference documentation

Installation

Without any optional dependencies

To install the core version of this package:

$ pip install cognite-sdk

With optional dependencies

A number of optional dependencies may be specified in order to support a wider set of features. The available extras (along with the libraries they include) are:

numpy [numpy]
pandas [pandas]
geo [geopandas, shapely]
sympy [sympy]
functions [pip]
yaml [PyYAML]
all [numpy, pandas, geopandas, shapely, sympy, pip, PyYAML]

To include optional dependencies, specify them like this with pip:

$ pip install "cognite-sdk[pandas, geo]"

or like this if you are using poetry:

$ poetry add cognite-sdk -E pandas -E geo

Performance notes

If you regularly need to fetch large amounts of datapoints, consider installing with numpy (or with pandas, as it depends on numpy) for best performance, then use the retrieve_arrays (or retrieve_dataframe) endpoint(s). This avoids building large pure Python data structures, and instead reads data directly into memory-efficient numpy.ndarrays.

Windows specific

On Windows, it is recommended to install geopandas and its dependencies using conda package manager, see geopandas installation page. The following commands create a new environment, install geopandas and cognite-sdk.

conda create -n geo_env
conda activate geo_env
conda install --channel conda-forge geopandas
pip install cognite-sdk

Changelog

Wondering about upcoming or previous changes to the SDK? Take a look at the CHANGELOG.

Migration Guide

To help you upgrade your code(base) quickly and safely to a newer major version of the SDK, check out our migration guide. It is a more focused guide based on the detailed change log. MIGRATION GUIDE.

Contributing

Want to contribute? Check out CONTRIBUTING.

cognite-sdk-python's People

Contributors

Stargazers

Watchers

cognite-sdk-python's Issues

numpy.int64 not accepted by sdk

Seems like CDP doesn't know how to handle int64. When pushing data that is type numpy.int64 I get an error. Changing it to float or to int fixes the issue. Several others have encountered the same issue since numpy.int64 is a very common type used. It would be good to have this type supported by the python-sdk.

get_events ignores keyword arguments such as max_start_time when asset_id is given

Describe the bug
As a data scientist, I would like get_events to take in an asset_id together with additional arguments such as min_start_time, max_start_time, or type at the same time, because I need events to be filtered down to specific assets, time periods, and types.

I suspect the issue to be caused by line 136 in events.py:
If an asset_id is given the keyword arguments are not taken into account.

To Reproduce

# set up
import pandas as pd
import numpy as np
import os
import cognite
import platform
from datetime import datetime, timedelta, date
import matplotlib.pyplot as plt

# connect to API
AKER_API_KEY = os.environ['AKER_API']
client = cognite.CogniteClient(api_key = AKER_API_KEY)

# take one random valve from SKARV
tag = '13-ESV-6001'
min_start = int(datetime(2019,2,28).timestamp()*1000)
valves = client.assets.search_for_assets(name = tag).to_pandas()
valve_1 = valves.loc[valves['name'] == tag]
valve_1_id = valve_1['id'].loc[0]

# try to filter down data for the valve that is not older that February 28 2019
ce1 = client.events.get_events(
    asset_id = valve_1_id,
    type = 'Workorder',
    min_start_time = min_start,
    limit = 1000).to_pandas()

# leave asset_id out
ce2 = client.events.get_events(
    type = 'Workorder',
    min_start_time = min_start,
    limit = 1000).to_pandas()

# plot results
# clearly, ce1 was not filtered by date, while ce2 was
fig, ax = plt.subplots(1,2)
ax[0].hist(ce1['startTime'].tolist())
ax[0].axvline(min_start,color = 'red')
ax[1].hist(ce2['startTime'].tolist())
ax[1].axvline(min_start,color = 'red')
plt.draw()

# also: ce1 contains a lot of types while ce2 only contains 'Workorder' as specified
print(ce1['type'].unique())
print(ce2['type'].unique())

Expected behavior
The ce1 DataFrame should not contain any records earlier than min_start and no other types than 'Workorder'.

Add `includeOutsidePoints`-parameter to get_datapoints

The parameter includeOutsidePoints of the /api/0.5/projects/{project}/timeseries/{id}/data-endpoint seems to be missing in the Python API.

Consider layering the SDK to decrease size

Pandas and numpy are heavy dependencies. We should consider splitting the SDK into a very thin layer that communicates with the API, and a thicker layer which includes utilities such as pandas.

Support for update asset

Describe the solution you'd like
Support for updating selected asset through SDK (stable)

Currently using SDK need to recreate the asset or use API directly outside SDK.

Additional context
0.5 API supports this operation though: https://doc.cognitedata.com/api/0.5/#operation/alterAssets

get_multi_time_series_datapoints unexpectedly mutates items in datapoint_queries - suprising results ensues

The items in datapoint_queries are mutated within get_multi_time_series_datapoints. In turn, reusing the same function arguments will lead to unexpected results where only subsets of the requested data is returned.

Example:

Note: Changing the get_it function to the following makes the behaviour predictable again since the DatapointsQuery-objects are reconstructed at each function call.

DataTransferService creates new client by default

The DataTransferService creates its own client, and users are easily confused into connecting to two different projects simultaneously.

Suppose I have the following variables set:

export COGNITE_API_KEY=xyz789 # for default_project
export MY_PROJECT_API_KEY=abc123 # For my_project

Then I naively read the documentation and do this:

from cognite import CogniteClient
from cognite.data_transfer_service import DataTransferService

client = CogniteClient(api_key=os.environ.get('MY_PROJECT_API_KEY'), project='my_project')
dts = DataTransferService()  # Will actually access 'default_project' with COGNITE_API_KEY!

I'm actually accessing two different projects with two different api keys, though this is not immediately apparent. To work around this, I have to always remember to explicitly pass api_key when creating a DataTransferService instance. That's a burden we should not put on our users.

Suggested fixes:

If using COGNITE_API_KEY, issue a warning like Warning: Using the COGNITE_API_KEY environment variable to connect to the CDP project 'default_project'
Change that signature of DataTransferService to take client as an optional parameter, and not api_key, cookies or num_of_workers. If no client is given, a new one is created.
Implement a getter client.data_transfer_service that returns DataTransferService(client=self).
Extend the documentation for DataTransferService to explicitly mention that a new client is created if none is given, and suggest to use cognite.data_transfer_service instead.

DatapointsResponse.to_pandas() loses time series name information

In order to get the name one currently has to first use DatapointsResponse.to_json()["name"]. This is inconvenient and I guess also kind of wasteful. Particularly in settings where one wants to get raw data points from several time series and subsequently merge them into a common data frame. Suggested fix is to either rename the column in pandas to the name of the time series (works for api 0.5, but not good for api 0.6) or increase the amount of meta data available on the drp object. E.g. add time series name and time series id to the object by default.

`include_outside_points=True` does not work

Does not work unless protobuf=False and processes=1, this should be set automatically when include_outside_points=True.

create_schedule yields 400 if description is not set

Enforce Black code style

We should enforce Black as code style in Jenkins, as we do in other projects.

Should handle pagination

When for example using get_assets then maybe the function should not only return the first 1000 elements but automatically return all assets.

post_datapoints_frame is undocumented

post_datapoints_frame(dataframe) does not take a time series name as an argument, and does not have an example of how to use the function. @trygvekk mentioned that the name of the data frame is accepted as a data frame header, but he did not have an example.

Create standard methods for performing CRUD operations

These standard methods should be resource-type agnostic. As the API specs are consolidated for v1 of the API, these endpoints will be consistent enough to create generic methods for get, create, list, update, delete, search.

get_multi_time_series_datapoints returns an iterator rather than a list

The documentation says

cognite-sdk-python/cognite/client/stable/datapoints.py

Lines 523 to 526 in 54350e0

  Returns: 

  list(stable.datapoints.DatapointsResponse): A list of data objects containing the requested data with several getter methods 

  with different output formats. 

  """

But the actual return value is

cognite-sdk-python/cognite/client/stable/datapoints.py

Line 576 in 54350e0

 return DatapointsResponseIterator([DatapointsResponse(result) for result in results]) 

Add sequence support

Add flag to include/exclude time series metadata in TimeSeriesResponse.to_pandas()

When getting multiple time series by ID, and they have very different metadata, it would be nice to choose whether or not all metadata fields should be separate columns in the resulting dataframe (when calling to_pandas()). I.e. something like to_pandas(show_metadata=False), which would result in a Pandas dataframe with a single column containing the metadata as dict.

`type` argument doesn't work for `events.get_events` method

Describe the bug
When you are trying to get events with specified type argument, it returns not only events with specified type

To Reproduce

from cognite.client import CogniteClient

cognite_client = CogniteClient()
asset_id = 8040340116462668

cursor_id = None
all_events = list()
while True:
  events_response = cognite_client.events.get_events(type="Workorder", asset_id=asset_id, autopaging=True)
  events = events_response.to_json()
  cursor_id = events_response.next_cursor()
  
  all_events.extend(events)
  
  if cursor_id is None:
    break


assert all([event['type'] == "Workorder" for event in all_events]), "Events contains not only `Workorders` events"

Expected behavior
All events should have type='Workorder'

Screenshots
In context of previous code snippet

Additional context
Also, I have the same process with type='Isolation_Certificate and it's not working

get_datapoints and get_datapoints_frame cap time series data

In some edge cases, the datapoints fetching methods will cap time series data before the user-specified end parameter.

Documentation of asset_id argument to TimeSeries is wrong

Documented as str but should be int.

Asset hierarchy visualization

Feature idea: Visualize asset hierarchy

As a data scientist I want to visualize, in a notebook, the asset hierarchy graph in the near vicinity of a certain node

Inspiration

Dasks DAG diagrams http://docs.dask.org/en/latest/custom-graphs.html
sklearn graph_viz for decision trees https://medium.com/@rnbrown/creating-and-visualizing-decision-trees-with-python-f8e8fa394176

Open questions

Does this belong here or in ML hosting? We want customers to be able to use this
Can this extend into a toolset for evaluating asset hierarchies?

Process leaking due to flawed process lifetime management

Due to how process lifetime is managed in cognite-sdk-python, spawned processes are not guaranteed to be terminated by the parent process, and may in some cases lead to "process leaks" (analogous to memory leaks). For an example of where this may happen, see this link. Note that this is not the only case where this pattern appears in the repository.

For example, if the call to Pool.map throws an exception, the part terminating the process(es) will never be called, and you have a "process leak".

This could be solved either by wrapping the code in try...finally or, preferably, by using a context manager.

Parameter "first_matches_only" in tagmatching function should default to False

timeseries.get_datapoints does not support end='now'

'now' is the default for end (undocumented in the SDK, documented in the REST API).

_utils.interval_to_ms() should support 'now'.

TimeSeries isStep property is always false

TimeSeries.step should be TimeSeries.isStep? Unless it is converted somewhere - the API docs state isStep:
https://doc.cognitedata.com/#schemaposttimeseriesv2dto

Adapt to 0.5 endpoint changes

Breaking changes

Storage/Files

/storage (Cloud Storage) endpoints renamed to /files (Files)
GET /api/0.5/projects/{project}/storage/{id}/info moved to /api/0.5/projects/{project}/files/{id} . Now returns a list with a single file.
GET /api/0.5/projects/{project}/storage/{id} moved to /api/0.5/projects/{project}/files/{id}/downloadlink

Assets

GET /api/0.5/projects/{project}/assets/{id} moved to /api/0.5/projects/{project}/assets/{id}/subtree
GET /api/0.5/projects/{project}/assets/{id} now returns a list with a single asset.

Events

GET /api/0.5/projects/{project}/events/{eventId} returns a list now with a single element

asset_subtrees does not work in search_for_events

APIError: Failed to parse json: java.lang.IllegalStateException: Expected BEGIN_ARRAY but was NUMBER at line 1 column 17 path $ | code: 400

timeseries.get_datapoints() only returns first aggregate

When specifying a list of aggregates, only the first is returned.

Default start time of 2w-ago for getting datapoints is not documented.

CogniteClient.get throws exception

When I run

from cognite import CogniteClient
CogniteClient().get('/login')

I get the following error:

...
File ".../cognite/client/cognite_client.py", line 164, in get
return self._api_client._get(url, params, headers)
File ".../cognite/client/_api_client.py", line 61, in wrapper
res = method(client_instance, full_url, *args, **kwargs)
TypeError: _get() got multiple values for argument 'headers'

The error is caused by the default params and header arguments (with the value of None) being passed as positional arguments to wrapper, which then adds keyword arguments before calling the wrapped _get method.

This problem probably applies to other functions wrapped by request_method, like post and delete.

I worked around the bug by bypassing the wrapper with CogniteClient()._api_client._get('/login').

Timeseries documentation typo, get_time_series()

This example in the newest docs show the function get_timeseries, but this function doesn't exist anymore! get_time_series is the new function.

The screenshot is the first code example in https://cognite-sdk-python.readthedocs-hosted.com/en/latest/cognite.html

Use Retry-After header after 429/503

Is your feature request related to a problem? Please describe.
We can some 429 issues from backend, and we should use their suggested delay before next request.

Describe the solution you'd like
Use this header value as a parameter to the retry-function.

V1.0.0

Fix autocompletion issues in CogniteClient

Currently we expose sub-clients using the @Property decorator. This breaks auto-completion in Jupyter. If we expose the attributes from the constructor directly we break auto-completion in other IDEs (such as PyCharm).

This can be fixed by not using the client factory AND exposing the clients directly as attributres from the constructor.

Error in example code

In Cognite SDK documentation in section Datapoints -> post_datapoints this Python code generate an error:

from cognite.client.stable.datapoints import Datapoint
client = CogniteClient()
start = 1514761200000
my_dummy_data = [Datapoint(timestamp=ms, value=i) for i, ms in range(start, start+100)]
client.datapoints.post_datapoints(my_dummy_data)

TypeError: cannot unpack non-iterable int object

Also post_datapoints should get two parameters instead of one in example.

client.time_series.get_time_series does not return metadata

Describe the bug
When executing client.time_series.get_time_series() with include_metadata = True no metadata is returned.

To Reproduce
Runnable code reproducing the error.

import cognite
import requests
import os
import numpy as np
import pandas as pd
from datetime import datetime, timedelta

from cognite.client.stable.time_series import TimeSeries
sm_api = os.environ['SM_API_KEY']
client = cognite.CogniteClient(api_key = sm_api)
ts_name = 'Test_tssssss'
my_time_series = [TimeSeries(name=ts_name, 
                             description = 'test_description',
                             metadata = { 'ASSETSCOPENAME' : 'meta_test_1' })]
client.time_series.post_time_series(my_time_series)

# create dummy data
np.random.seed(1338)
start_time = int((datetime.now()-timedelta(1)).strftime("%s"))
timestamps = [(start_time + i * 10)*1000 for i in np.arange(11)]
df = pd.DataFrame({'timestamp' : timestamps})
df[ts_name] = np.random.random(df.shape[0])
client.datapoints.post_datapoints_frame(df)
# get time_series
ts1 = client.time_series.get_time_series(name = ts_name,
                                         include_metadata = True).to_pandas()
ts1_id = ts1['id'].loc[0] 
print(ts1.loc[0])
# no meta data
# requests:
# first with no metadata
r1 = requests.get(url = 'https://api.cognitedata.com/api/0.5/projects/smart-maintenance-sandbox/timeseries/' + str(ts1_id) ,
                  headers= { 'Api-Key' : sm_api} , params = {"includeMetadata" : False})
print(r1.text.split('\n'))
# then with metadata
r1 = requests.get(url = 'https://api.cognitedata.com/api/0.5/projects/smart-maintenance-sandbox/timeseries/' + str(ts1_id) ,
                  headers= { 'Api-Key' : sm_api} , params = {"includeMetadata" : True})
print(r1.text.split('\n'))

Expected behavior
The client.time_series.get_time_series(name = ts_name,include_metadata = True) should return the metadata.

LatestDatapointResponse.to_json() throws IndexError for time series without data points

When I run

client.datapoints.get_latest('ts_without_datapoints').to_json()

I get the exception

.../cognite/client/stable/datapoints.py in to_json(self)
    114     def to_json(self):
    115         """Returns data as a json object"""
--> 116         return self.internal_representation["data"]["items"][0]
    117 
    118     def to_pandas(self):

IndexError: list index out of range

Where

self.internal_representation = {'data': {'items': []}}

This is confusing, and at first glance seemed like a bug in the SDK. A more user friendly approach would be to either:

throw a custom exception e.g. NoDatapointsException when calling get_latest
return None from to_json()

I would prefer the former, or optionally have a flag to control the behavior.

Empty artifacts folder should raise exception when building source package

search_for_events not listed in documentation

Create standard methods for performing CRUD operations

Parameters and attributes mix snake_case and camelCase for same class

In all classes which represent a resource which will be passed to the api, the parameters use snake_case while the attributes on the object object use camelCase. This is confusing for the user and results in errors that are difficult to debug.

The reason this is done, is so that the client can simply access the dict attribute on the object to convert it to a suitable json format without having to convert the keys to camelCase first.

This behaviour should change so that we consistenly use snake_case in the sdk, and convert to camelCase only when actually sending the object to the API.

Bug in the not-documentation of TimeSeries `isStep` parameter

The parameter is documented as isStep, but it's actually is_step. Since the DTO is documented as isStep, I don't know what the right solution is.

But I do know it's not a bug in the documentation!

matplotlib not imported

Currently, user needs to run pip install matplotlib to run the script in the readme.

Dead link in examples

Describe the bug
Dead link to pattern search in
https://github.com/cognitedata/cognite-sdk-python/tree/master/examples

To Reproduce
Check the link https://github.com/cognitedata/cognite-sdk-python/tree/master/examples

Expected behavior
No link. API deprecated.

Inconsistency in cognite.v05 namespace

cognite.v05.assets is not included in the namespace but cognate.v05.timeseries is. Should we not choose to go with both not being there or both being there?

Create Base DTOs

Currently we separate between the DTO which is used for writing and used for reading resources.
e.g. Asset, AssetResponse, and AssetListResponse.

This should be consolidated into a single DTO. We should probably still keep the List DTO so that we can have helper methods on this object like to_pandas().

All DTOs should have the following properties:

.to_pandas()
.__str__()
.__repr__()
.__eq__()
._load()
._dump(camel_case: bool)
All properties of the respective resource

ListDTOs should have the following properties:

.to_pandas()
.__eq__()
.__str__()
.__repr__()
._load()
all list properties (.__getitem__, .__len__, .__iter__, .__next__)
All properties of the respective resource

Note:

to_pandas() should lazy load pandas dependency to support sdk-core package
__str__ and __repr__ should both return pretty-printable representations of the resource

RawClient does not follow the same conventions as the rest of the SDK

Cannot iterate over responses
Cannot pretty print resources

String time series do not work with protobuf

When calling get_datapoints on a string time series (e.g. 'VAL_23-LIC-92521:MODE'), nothing is returned unless protobuf=False is passed.

Pretty print for DatapointsQuery

Quite useful indeed.

Get a specific time series using name or id

Is your feature request related to a problem? Please describe.
I often have a time series id or name and would like to look up the metadata of the time series.
I'd like to be able to get a timeseries by id and by name.

Describe the solution you'd like
Two additional access patterns:

time_series.get_time_series(name='TIMESERIESNAME')
time_series.get_time_series(id=1234567890123)

Which return an error if the time series is found.

Describe alternatives you've considered
The prefix parameter of time_series.get_time_series works, but this is a search, that will return multiple results. Taking the first result in the response will lead to logical errors if done without additional checks.

Additional context
Open quesion: are time series names unique in CDP, or is it just the id?

	Returns:
	list(stable.datapoints.DatapointsResponse): A list of data objects containing the requested data with several getter methods
	with different output formats.
	"""

cognitedata / cognite-sdk-python Goto Github PK

cognite-sdk-python's Introduction

Cognite Python SDK

Reference documentation

Installation

Without any optional dependencies

With optional dependencies

Performance notes

Windows specific

Changelog

Migration Guide

Contributing

cognite-sdk-python's People

Contributors

Stargazers

Watchers

Forkers

cognite-sdk-python's Issues

Feature idea: Visualize asset hierarchy

Inspiration

Open questions

Breaking changes

Storage/Files

Assets

Events

Recommend Projects

Recommend Topics

Recommend Org