Developing a Robust Data Platform : Key Considerations


Developing a robust data platform requires definitely more than HDFS, Hive, Sqoop and Pig. Today there is a real need for bringing data and compute as close as possible. More and more requirements are forcing us to deal with high-throughput/low-latency scenarios. Thanks to in-memory solutions, things definitely seems possible right now.

One of the lesson I have learnt in the last few years is that it is hard to resist developing your own technology infrastructure while developing a platform infrastructure. It is always important to remind ourselves that we are here to build solutions and not technology infrastructure.

Some of the key questions that needs to be considered while embarking on such journey is that

  1. How do we handle the ever growing volume of data (Data Repository)?
  2. How do we deal with the growing variety of data (Polyglot Persistence)?
  3. How do we ingest large volumes of data as we start growing (Ingestion Pipelines/Write Efficient)?
  4. How do we scale in-terms of faster data retrieval so that the Analytics engine can provide something meaningful at a decent pace?
  5. How do we deal with the need for Interactive Analytics with a large dataset?
  6. How do we keep our cost per terabyte low while taking care of our platform growth?
  7. How do we move data securely between on premise infrastructure to cloud infrastructure?
  8. How do we handle data governance, data lineage, data quality?
  9. What kind of monitoring infrastructure that would be required to support distributed processing?
  10. How do we model metadata so that we can address domain specific problems?
  11. How do we test this infrastructure? What kind of automation is required?
  12. How do we create a service delivery platform for build and deployment?

One of the challenges I am seeing right now is that the urge to use multiple technologies to solve similar problems.  Though this gives my developers the edge to do things differently/efficiently, from a platform perspective this would increase the total cost of operations.

  1. How do we support our customers in production?
  2. How can we make the life our operations teams better?
  3. How do we take care of reliability, durability, scalability, extensibility and Maintainability of this platform?

Will talk about the data repository and possible choices in the next post.

Happy Learning!


SaaS: CapEx, OpEx…

If you are dealing something related to SaaS or Cloud computing, then you must have heard these terms very frequently. I wanted to understand it better and found useful information in Wikipedia

Capital expenditures (CAPEX) are expenditures creating future benefits. A capital expenditure is incurred when a business spends money either to buy fixed assets or to add to the value of an existing fixed asset with a useful life that extends beyond the taxable year. Capex are used by a company to acquire or upgrade physical assets such as equipment, property, or industrial buildings.

An Operating expense, operating expenditure, operational expense, operational expenditure or OPEX is an on-going cost for running a product, business, or system. Its counterpart, a capital expenditure (CAPEX), is the cost of developing or providing non-consumable parts for the product or system. For example, the purchase of a photocopier is the CAPEX, and the annual paper and toner cost is the OPEX. For larger systems like businesses, OPEX may also include the cost of workers and facility expenses such as rent and utilities.

Some useful links
CAPEX/OPEX from Project’s Manager point of view
CAPEX vs OPEX: What is the difference?
SaaS decisions: Cap Ex vs Op Ex
Accounting for Clouds: Stop Saying CapEx Vs. OpEx

Is Multi-Tenancy a prerequisite for SaaS?

I recently attended a conference on cloud computing and one of the speakers said, if your application is not multi-tenant, then your application is not SaaS.

Let us look at the SaaS system Characteristics

1. Availability via Web Browser
2. On-demand availability
3. Pay-per usage
4. Minimal or zero IT Demands.

Let us look at what a multi tenant application is all about.  It’s a model where multiple clients can be supported in one single software instance. This will help the SaaS Provider to support more clients on fewer hardware components; rollouts/updates will be easier.

Read this post on Multi Tenant Architecture from MSDN to know more about Multi Tenancy.

My point here is, it depends on the service offerings and the customizations required. Also, it’s about the way you manage your deployments. I am not disagreeing that this may provide the SaaS Provider some cost benefits, which may result in the form of lower services fees to the end users. It’s based on what is really needed.

I was searching in web and found a post on similar lines

If you buy SaaS, don’t get lured into multi-tenancy marketo-munjo-jumbo and concentrate on features, SLA, integration options and cost.

If you are a service provider, then, yes, multi-tenancy is a (potentially very important) internal secret sauce that you can use to augment your economy of scale (at the expense of other aspects) but it is by no means a prerequisite, the right trade-off between multi-tenancy and isolation will depend on a myriad of factors and is often unique to the situation. As mentioned in this blog in the past and in Phil’s post today, virtualization can be a very successful way of achieving interesting levels of economy of scale without architecting the application for full multi-tenancy.

A high level model that served me well in the past in helping company understand whether they should go multi-tenant or not is the “cost per feature” vs. “cost per tenant” model.


Some useful links
Cloud Application Architectures: Building Applications and Infrastructure in the Cloud

Happy Learning!!!