Databricks Cluster Size

james vasanth 0 Reputation points
2024-04-16T14:55:11.34+00:00

Hi,

We have a below use case.

We are developing ML Development using azure data bricks cluster service's. Our data bricks will receive 20 GB size of dataset records and we are running our logic/algorithms against this dataset's.

We have to run every day once this ML algorithms against 20 GB size of dataset's. I'm not able to Identify the required cluster size.

I have checked while creating compute in data bricks available ML based GPU supported clusters but its cost more.

Please let me the best compute size to perform this operations.

Thanks.

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
1,926 questions
{count} votes

4 answers

Sort by: Most helpful
  1. Luis Arias 4,721 Reputation points
    2024-04-16T18:00:32.3266667+00:00

    Hello james vasanth,

    There are several factors and best practices to consider. Here are some key points:

    • User Type: The type of user (data scientist, data engineer, data analyst) can influence the compute configuration.
    • Workload Type: Different workloads (ETL jobs, analytical workloads) have different requirements.
    • Service Level Agreement (SLA): The level of SLA you need to meet can influence your configuration.
    • Budget Constraints: Your budget can determine the size and type of cluster you choose.
    • Compute Features: Understanding the features of Databricks compute (all-purpose compute, job compute, single-node, multi-node, on-demand, spot instances) can help you make an informed decision.
    • Concurrent Queries: Databricks recommends a cluster for every 10 concurrent queries.
    • Data Size: The size of your data can also influence the size of your cluster.

    Your configuration decisions will require a tradeoff between cost and performance. It’s recommended to start with a smaller cluster and then monitor the performance. You can then adjust the size of the cluster as needed based on your observations.

    References:

    If the information helped address your question, please Accept the answer.

    Luis

    1 person found this answer helpful.
    0 comments No comments

  2. Deleted

    This answer has been deleted due to a violation of our Code of Conduct. The answer was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.


    Comments have been turned off. Learn more

  3. Deleted

    This answer has been deleted due to a violation of our Code of Conduct. The answer was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.


    Comments have been turned off. Learn more

  4. PRADEEPCHEEKATLA-MSFT 77,336 Reputation points Microsoft Employee
    2024-04-17T04:29:54.8133333+00:00

    @james vasanth - Thanks for the question and using MS Q&A platform.

    To determine the recommended cluster size for your use case, you need to consider the size of your dataset, the complexity of your algorithms, and the desired performance of your job.

    In your case, you have mentioned that you are working with a 20 GB dataset and running ML algorithms against it once a day. Based on this information, you can start with a smaller cluster size and scale up if needed.

    For example, you can start with a cluster with 2-4 worker nodes and 8 GB of memory per node. You can monitor the job performance and resource utilization and scale up the cluster if needed.

    User's image

    For your workload, if you require GPU support, you can consider using a GPU-enabled VM SKU such as the NCasT4_v3-series as shown below. These VM SKUs are optimized for GPU workloads and provide a good balance of compute and memory resources.

    User's image

    It's also important to note that GPU-based clusters can provide significant performance improvements for certain types of ML workloads, but they can be more expensive. If you don't require GPU support, you can consider using a CPU-based cluster to reduce costs.

    Overall, the best compute size for your operations will depend on your specific requirements and workload characteristics.

    For more details, refer to Compute configuration best practices and Create a cluster using Databricks Runtime ML.

    Hope this helps. Do let us know if you any further queries.


    If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

    0 comments No comments