Giter VIP home page Giter VIP logo

databricks-rest-client's Introduction

Databricks Java Rest Client

This is a simple java library that provides programmatic access to the Databricks Rest Service.

NOTE: that the project used to be under the groupId com.edmunds.databricks It is now under com.edmunds.

At some point we will plan on deleting the old artifacts from maven-central.

Build Status

Build Status

API Overview

Javadocs

This library only implements a percentage of all of the functionality that the Databricks Rest Interface provides. The idea is to add functionality as users of this library need it. Here are the current Endpoints that are supported:

  • Cluster Service

  • DBFS Service

  • Groups Service

  • Instance Profiles Service

  • Job Servicev21

  • Library Service

  • Workspace Service

  • Groups Service

  • Instance Profiles Service

  • SCIM (preview mode, subject to changes)

Please look at the javadocs for the specific service to get more detailed information on what functionality is currently available.

If there is important functionality that is currently missing, please create a github issue.

Examples

public class MyClient {
  public static void main(String[] args) throws DatabricksRestException, IOException {
    // Construct a serviceFactory using token authentication
    DatabricksServiceFactory serviceFactory =
        DatabricksServiceFactory.Builder
            .createServiceFactoryWithTokenAuthentication("myToken", "myHost")
            .withMaxRetries(5)
            .withRetryInterval(10000L)
            .build();
    
    // Lets get our databricks job "myJob" and edit maxRetries to 5
    JobDTOv21 jobDTO = serviceFactory.getJobServiceV21().getJobByName("myJob");
    JobSettingsDTOv21 jobSettingsDTO = jobDTO.getSettings();
    jobSettingsDTO.setMaxRetries(5);
    serviceFactory.getJobServiceV21().upsertJob(jobSettingsDTO, true);

    // Lets install a jar to a specific cluster
    LibraryDTO libraryDTO = new LibraryDTO();
    libraryDTO.setJar("s3://myBucket/myJar.jar");
    for (ClusterInfoDTO clusterInfoDTO : serviceFactory.getClusterService().list()) {
      if (clusterInfoDTO.getClusterName().equals("myCluster")) {
        serviceFactory.getLibraryService().install(clusterInfoDTO.getClusterId(), new LibraryDTO[]{libraryDTO});
      }
    }
    }
}

For more examples, take a look at the service tests.

Building, Installing and Running

Getting Started and Prerequisites

Building

How to build the project locally: mvn clean install

Unit Tests

There are currently no unit tests for this project. Our thoughts are that the only testable functionality is the integration between our client on an actual databricks instance. As such we currently only have integration tests.

Integration Tests

IMPORTANT: integration tests do not execute automatically as part of a build. It is your responsibility (and Pull Request Reviewers) to make sure the integration tests pass before merging in code.

Setup

You need to set the following environment properties in your .bash_profile

export DB_URL=my-databricks-account.databricks.com
export DB_TOKEN=my-token

In order for the integration tests to run, you must have a valid token for the user in question. Here is how to set it up: Set up Tokens Note: In order to run the SCIM integration tests your user should have admin rights

Executing Integration Tests

mvn clean install org.apache.maven.plugins:maven-failsafe-plugin:integration-test

Deployment

Please see the CONTRIBUTING.md about our release process. As this is a library, there is no deployment operation needed.

Contributing

Please read CONTRIBUTING.md for the process for merging code into master.

databricks-rest-client's People

Contributors

bolshem avatar cornelcreanga avatar dependabot[bot] avatar javamonkey79 avatar jeff303 avatar joongho avatar jyothikomm avatar reillydj avatar samshuster avatar seregasheypak avatar ssh-parity avatar techpavan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

databricks-rest-client's Issues

databricks-rest-client:3.0.6 - java.lang.NoSuchMethodError

databricks-rest-client:3.0.6 introduced log4j 2.17.2... however, a new issue was introduced

Exception in thread "main" java.lang.NoSuchMethodError: 'java.lang.ClassLoader org.apache.logging.log4j.util.StackLocatorUtil.getCallerClassLoader(int)'
	at org.apache.log4j.Logger.getLogger(Logger.java:40)
	at com.edmunds.rest.databricks.restclient.DefaultHttpClientBuilderFactory.<clinit>(DefaultHttpClientBuilderFactory.java:44)
	at com.edmunds.rest.databricks.DatabricksServiceFactory$Builder.build(DatabricksServiceFactory.java:352)
	at X.X.clients.Databricks.ServiceFactoryByToken(Databricks.java:26)
	at VincentApp.main(VincentApp.java:26)

Add support for InstanceProfiles API

Databricks Rest API has an Instance Profiles endpoint, allowing for creation and deletion of instance profiles to databricks. This issue would create the InstanceProfilesService class and implementation.

DatabricksRestClientTest only tests Password Authenticated Client

The DataProvider(name = "Clients") provides three references to the same client, the password authenticated client, since the DatabricksFixtures class only holds one reference to a client.

The goal is to have the DataProvider provide references to each of the three different kinds of clients.

Upgrade log4j from 1.2.17 to latest

Apache Log4j2 versions 2.0-beta7 through 2.17.0 (excluding security fix releases 2.3.2 and 2.12.4) are vulnerable to a remote code execution (RCE) attack where an attacker with permission to modify the logging configuration file can construct a malicious configuration using a JDBC Appender with a data source referencing a JNDI URI which can execute remote code. This issue is fixed by limiting JNDI data source names to the java protocol in Log4j2 versions 2.17.1, 2.12.4, and 2.3.2.

Add more documentation for checkstyle guidelines

Documentation for getting google standard checkstyle setup can use some beefing up. Specifically, add some notes (for intellij users) on installing the correct plugin, uploading the google standards, and using the checkstyle code scan.

Update (again) TerminationCodeDTO

A new Azure code was added (GLOBAL_INIT_SCRIPT_FAILURE) - not present yet in the documentation. The Jackson deserialization will fail.

In future I think that a better solution would be to convert the unknown enums to null (com.fasterxml.jackson.databind.DeserializationFeature.READ_UNKNOWN_ENUM_VALUES_AS_NULL ) instead of throwing an error. The Azure API seems to change often and it will be hard to keep the pace.

#83

@samshuster

Cleanup / organize DTOs

There are plenty of DTOs there - do we really need all of them? If so maybe we could javadoc these classes and perhaps organize into a separate folders

Update (again) TerminationCodeDTO according with the last API spec

New codes were added (AZURE_RESOURCE_PROVIDER_THROTTLING, AZURE_RESOURCE_MANAGER_THROTTLING,
NETWORK_CONFIGURATION_FAILURE) and the Jackson deserialization will fail for this cases.

The codes were only added into the MSFT specification https://docs.microsoft.com/en-us/azure/databricks/dev-tools/api/latest/clusters#--terminationcode . The databricsk link (https://docs.databricks.com/dev-tools/api/latest/clusters.html#clusterterminationreasonterminationcode) seems to be lagging behind.

create toString methods for databricks DTO classes

Narrative

As an engineer working with the databricks rest clientI'd like meaningful toString methods on all java bean classesSuch that I can easily work with and debug these classes

Implementation Details

Nice to have: use jackson to create toString for DTOs, to match up with annotations, e.g.

//ObjectMapper could come from a constant
return new ObjectMapper().writeValueAsString(this);

Alternatively, use ReflectionToString from commons-lang:

return ReflectionToString.toString(this);

Update instance types used in ClusterServiceTest

The instance types specified in the ClusterServiceTest integration tests have been deprecated by databricks. It is probably best to update them to a instance that is still supported.

I'd suggest changing r3.xlarge to m4.large where applicable.

Update README for recent issues

Groups API and InstanceProfiles API were recently implemented, but the README was not updated to reflect that functionality.

Goal of this issue is to update the README to show that those two APIs are available for use.

pass tokens in api calls

One suggestion is to pass tokens in the API calls. This is the recommended way of authenticating and also opens the API up to Azure Databricks users.

Allow use of custom HttpRequestExecutor in HttpClient

Hello, we are currently using the library to make run jobs in our production environment. We need to collect metrics around requests to the Databricks API and we use a custom class extending HttpRequestExecutor to accomplish this right now. We had to implement our own extension of the DatabricksRestClientImpl in order to pass in our custom HttpRequestExecutor to the HttpClientBuilder which led to a lot of duplicated code with the AbstractDatabricksRestClient#initClient method since the builder is not accessible before initialization. We would like a way to pass this executor to the builder are willing to contribute to the project to make it happen.

My first thought was to add this executor to the DatabricksServiceFactory.Builder class so that it could be set in the initClient method, but it does not seem to fit the pattern of all the values in the builder being primitives or strings. Right now I do not see a pattern for passing custom parameters to the HttpClientBuilder. Is there a preferred way to accomplish this that I could work on?

Here is a snippet of the workaround we have implemented:

    public CustomDatabricksRestClient(DatabricksServiceFactory.Builder builder, ...) {
        super(builder);
        initClientWithExecutor(builder, ...);
    }

    @Override
    protected void initClient(DatabricksServiceFactory.Builder builder) {
        // No-op init
    }

    private void initClientWithExecutor(DatabricksServiceFactory.Builder builder, ...) {
        CustomHttpRequestExecutor customHttpRequestExecutor = new CustomHttpRequestExecutor(...);

        HttpClientBuilder clientBuilder = HttpClients.custom().useSystemProperties()
                .setRetryHandler(retryHandler)
                .setServiceUnavailableRetryStrategy(retryStrategy)
                .setRequestExecutor(customHttpRequestExecutor)
                .setDefaultRequestConfig(createRequestConfig(builder));

        List<Header> headers = new ArrayList<>();
        if (StringUtils.isNotEmpty(builder.getToken())) {
            Header authHeader = new BasicHeader("Authorization", String.format("Bearer %s", builder.getToken()));
            headers.add(authHeader);
        }

        String userAgent = builder.getUserAgent();
        if (userAgent != null && userAgent.length() > 0) {
            Header userAgentHeader = new BasicHeader("User-Agent", userAgent);
            headers.add(userAgentHeader);
        }

        if (!headers.isEmpty()) {
            clientBuilder.setDefaultHeaders(headers);
        }

        try {
            SSLContext ctx = SSLContext.getDefault();
            // Allow TLSv1.2 protocol only
            SSLConnectionSocketFactory sslsf = new SSLConnectionSocketFactory(
                    ctx,
                    new String[]{"TLSv1.2"},
                    null,
                    SSLConnectionSocketFactory.getDefaultHostnameVerifier());
            clientBuilder = clientBuilder.setSSLSocketFactory(sslsf);
        } catch (Exception e) {
            _log.error("", e);
        }

        client = clientBuilder.build(); //CloseableHttpClient

        url = String.format("https://%s/api/%s", host, apiVersion);
        mapper = new ObjectMapper().setSerializationInclusion(JsonInclude.Include.NON_DEFAULT);
    }

ClusterService should have an "upsert" method and it should take ClusterAttributesDTO as parameters

The purpose of this story is to make the ClusterService easier to use.
However, we should keep it backwards compatible for the time being.

Requirement 1

ClusterService currently has a create and edit method, but no upsert method (upsert here meaning create it if it doesn't exist or update the configuration if it already exists) which is a very useful piece of logic to have in the library.

Requirement 2

In addition, we should offer methods in ClusterService that take ClusterAttributeDTO which can be deserialized directly from JSON objects, instead of enforcing users to use the CreateClusterRequest objects.

For an example look at the JobService which does not use Request objects anymore.

I think the old CreateClusterRequest methods should be marked as deprecated.

Please remove log4j.xml from the project

Please remove log4j.xml from the project. There is no reason for a project that's going to be included as a library into another one to define its own log4 file. If there are multiple log4 files the classloader will just choose one of them and.

Jackson dependencies - NoClassDefFoundError

When using databricks-rest-client, I am getting
com/fasterxml/jackson/annotation/JsonMerge: java.lang.NoClassDefFoundError java.lang.NoClassDefFoundError: com/fasterxml/jackson/annotation/JsonMerge

Inspecting dependencies, I see that

[INFO] +- com.edmunds:databricks-rest-client:jar:2.1.2:compile
[INFO] | +- com.fasterxml.jackson.core:jackson-databind:jar:2.9.7:compile
[INFO] | | +- com.fasterxml.jackson.core:jackson-annotations:jar:2.6.0:compile
[INFO] | | \- com.fasterxml.jackson.core:jackson-core:jar:2.9.7:compile

(I know that the latest version of databricks-rest-client is 2.2.x, but jackson dependencies didn't change)

It pulls in jackson-databind version 2.9.7, that for some reason pulls in older version of jackson-annotations (2.6.0).

jackson-annotation v 2.6.0 does not have JsonMerge
https://github.com/FasterXML/jackson-annotations/blob/2.6/src/main/java/com/fasterxml/jackson/annotation/JsonMerge.java
when the latest version does
https://github.com/FasterXML/jackson-annotations/blob/ab01a57066d441ed4eda8719808de2f39f094973/src/main/java/com/fasterxml/jackson/annotation/JsonMerge.java

Googling the issue found me this:
FasterXML/jackson-annotations#119

Update TerminationCodeDTO according with the last API spec

Some of the codes specified here https://docs.databricks.com/dev-tools/api/latest/clusters.html#clusterterminationreasonterminationcode are not reflected in the TerminationCodeDTO => error on deserialization (*) => the cluster API is not usable.

(*)
Example:
Cannot deserialize value of type com.edmunds.rest.databricks.DTO.clusters.TerminationCodeDTO from String "SPARK_ERROR": not one of the values accepted for Enum class: [CLOUD_PROVIDER_LAUNCH_FAILURE, INIT_SCRIPT_FAILURE, INTERNAL_ERROR, INSTANCE_UNREACHABLE, INSTANCE_POOL_CLUSTER_FAILURE, COMMUNICATION_LOST, REQUEST_REJECTED, INACTIVITY, CLOUD_PROVIDER_SHUTDOWN, USER_REQUEST, TRIAL_EXPIRED, INVALID_ARGUMENT, SPARK_STARTUP_FAILURE, UNEXPECTED_LAUNCH_FAILURE, JOB_FINISHED]

#77

@samshuster

Accessing databricks from behind a proxy

Many enterprises use web proxies.

But unfortunately it seems to me that it is not possible to use the databricks-rest-client library in such an environment because of the use of Apache HttpClient. This class has a long known issue whereby the standard Java system properties relating to proxy setup (http.proxyHost, http.proxyPort, https.proxyHost, https.proxyPort and http.nonProxyHosts) are ignored by default.

The suggested solution is to call useSystemProperties() on the HttpClientBuilder used to create the client.

This (I think!) would require a small change to the first line of DatabricksRestClientImpl.initClient(...)

//    HttpClientBuilder clientBuilder = HttpClients.custom()
    HttpClientBuilder clientBuilder = HttpClientBuilder.create().useSystemProperties() // pick up Java proxy settings

However I'm a little uncomfortable submitting a pull request for this change as I'm not sure if useSystemProperties() will stomp some other aspects of the build configuration which is important for communicating with a databricks endpoint. Or maybe there is a work around for this issue that I haven't spotted.

(Btw. I'm using this library through the databricks-maven-plugin - and our build/CI system is secured behind a proxy which I think is a fairly common set-up.)

Add support for new object types introduced in Workspace List API

Recently the workspace list API seems to be returning new object types (FILE & REPO) (https://docs.databricks.com/dev-tools/api/latest/workspace.html#objecttype) (the type FILE is not listed here yet but we're seeing those in our API responses).
This library has only 3 file types as far as I can see, hence wanted to add support to the new types.
I can make a contribution too but just wanted to report the issue and get the opinion before moving forward.

image

Group Integration Test Not Working

As of the last couple of months, the group api appears to function differently then it used to. I am not sure that the functionality is necessarily broken for users, but the integration test appears to be broken unless account api access is possible by the test user.

This story would be to examine how the test could be improved.

Clean up JobServiceTest and JobRunnerTests

Right now there is code repetition between the two classes.
The tests have also become a bit sloppy and could be cleaned up.

Goal would be to:

  1. abstract away commonalities between the two.
  2. Clean up the tests.

Bring ClusterService interface up-to-date

There are a few new methods available in Databricks' Cluster API.

Goal:
Add support for the following methods to the ClusterService

  1. pin
  2. unpin
  3. list node types
  4. list zones
  5. spark versions
  6. permanent delete

more runnable classes

It might be good to have more runnable classes with main that wrap service calls or even a generic runner class which takes the service as a cli arg.

Add mvn-checkstyle

In order to ensure that project keeps consistent formatting, we need checkstyle as part of the build.

In terms of the checkstyle.xml used, we should determine if we should use Edmunds.com's checkstyle.xml or use a different one.

Add support for jobs API 2.1 (multitask jobs)

Current, it is not possible to configure multiple tasks for one job. However, Databricks Jobs API 2.1 allows that, and this is a feature we would like to use configuring jobs programmatically using the library. Is it possible to add support for that feature?

Change the initScripts field in NewClusterDTO to be an array

One of our processes failed with an error: com.fasterxml.jackson.databind.JsonMappingException: Can not deserialize instance of com.edmunds.rest.databricks.DTO.InitScriptInfoDTO out of START_ARRAY token

According to the Databricks API documentation "init_scripts" is an array of InitScriptInfo. Change the initScripts field in NewClusterDTO to be InitScriptInfoDTO[] to align with Databricks API requirements.

Adopt checkstyle settings for a modern plugin version

Settings file (checkstyle/google-idea-checkstyle.xml) is incompatible with checkstyle plugin versions available for the latest Idea releases.

AC

google-idea-checkstyle.xml can be imported from modern Idea versions

Add support for Groups API

Databricks Rest API has a Groups endpoint, allowing for creation and deletion of databricks groups. This issue would create the GroupsService class and implementation.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.