question

HimanshuSinhamfst-5269 avatar image
1 Vote"
HimanshuSinhamfst-5269 asked KranthiPakala-MSFT edited

Data Lake and Environments - Best Practice

Hello All,

Is it a best practice to have one Big Data Lake for all the environments (Dev, Stage, QA and Prod) or have a Data Lake for Prod and another for Non-Prod ... etc.?

If we chose to share a data lake across environments, then audit will play a major role in it. It would really help if others can share their experience and guidance.

Thanks

[Note: As we migrate from MSDN, this question has been posted by an Azure Cloud Engineer as a frequently asked question]

MSDN Source: DataLake and Environments - Best Practice




azure-data-lake-storage
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

1 Answer

KranthiPakala-MSFT avatar image
2 Votes"
KranthiPakala-MSFT answered KranthiPakala-MSFT edited

Welcome to the Microsoft Q&A (Preview) platform.

Happy to answer your query.

You may checkout “FAQs about organizing a Data Lake”, which addressing your query.

If I need a separate dev, test, prod environment, how would this usually be handled?

Usually separate environments are handled with separate services. For instance, in Azure, that would be 3 separate Azure Data Lake Storage resources (which might be in the same subscription or different subscriptions).

We wouldn’t usually separate out dev/test/prod with a folder structure in the same data lake. It can be done (just like you could use the same database with a different schema for dev/test/prod) but it’s not the typical recommended way of handling the separation. We prefer having the exact same folder structure across all 3 environments. If you must get by with it being within one data lake (one service), then the environment should be the top-level node.

Regarding monitoring in ADLS Gen2:

Azure Data Lake Storage Gen2 provides metrics in the Azure portal under the Data Lake Storage Gen2 account and in Azure Monitor. Availability of Data Lake Storage Gen2 is displayed in the Azure portal. To get the most up-to-date availability of a Data Lake Storage Gen2 account, you must run your own synthetic tests to validate availability. Other metrics such as total storage utilization, read/write requests, and ingress/egress are available to be leveraged by monitoring applications and can also trigger alerts when thresholds (for example, Average latency or # of errors per minute) are exceeded.

For more details, refer “Best practices for using Azure Data Lake Storage Gen2”.

Hope this helps. Do let us know if you have any further queries.


5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.