+33 146 372 242

OVHCloud DataProcessing : Real ‘Spark as a service’


OVHCloud knows well Novagen and its commitment to innovation, and when they proposed to be part of the early testers of their new product of Data Processing built on top of Apache Spark as a service, Novagen felt “honoured and eager to test it.”

When it comes to select a new technology for our Data activities, Novagen wants the following characteristics to be fully addressed :

  1. Ability to foster Innovation & Creativity,
    • Functionalities, additional value, easy to use,
  2. Functionalities, additional value, easy to use,
  3. Efficiency, Cost-effectiveness,
    • Intrinsic performances
    • Adaptive architectures allow to adjust infrastructure to customer needs,
  4. Standards and Governance,
    • Customers adopt cloud or multi-cloud strategies. Relying on standards limits efforts to deploy on different targets and preserves reversibility
  5. Compliance:
    • Most companies have sensitive data and must know to which regulatory rules their cloud provider must follow.

Novagen, as a Data Consultancy Company, are extensive Apache Spark Users !

Apache Spark is the swiss army knife to process data:

  • Works at extremely high scale of data,
  • Addresses Data engineering and Data Science,
  • Processing of Data at rest and streaming data
  • De facto standard for data workloads on-premises and in cloud
  • Built-in APIs for Python, Scala, Java and R

Novagen has progressively developed software assets on top of Apache Spark to address recurring challenges :

  • ETL processing in Data Lake environnements,
  • Quality KPIs on top of Data Lake sources,
  • Machine Learning Algorithm for Natural Language Processing, Time Series predictions…

First step : select the best Novagen Use Case

Novagen has first considered the following characteristics of OVHCloud Data processing

⇒ Processing engine built on top of Apache Spark 2.4.3
⇒ Jobs start after a few seconds (vs minutes to launch a cluster)
⇒ Ability to adjust power dedicated to different spark jobs : start with low power (1 driver and 1 executor with 4 cores and 8Gb of memory) to high scale processing (potential hundreds of cores and Gb of memories)
⇒ A full Compute/Storage separation aligned with standard of cloud architectures, including S3 APIs to access data stored in Object Storage layer.
⇒ Jobs execution and monitoring through Command Line Interface and API

These characteristics led Novagen to chose their Quality Assessment Process as an ideal use case which requires both interactivity and adjustable power: Deliver quality KPIs through spark processes.

Second step : OVHCloud Data Processing at work

The corresponding command generated by the quality software is :

./ovh-spark-submit –projectid ec7d2cb6da084055a0501b2d8d8d62a1 –class tech.novagen.spark.Launcher –driver-cores 4 –driver-memory 8G –executor-cores 4 –executor-memory 8G –num-executors 5 swift://sparkjars/QualitySparkExecutor-1.0-spark.jar –apiServer=

The command which is quite similar to a usual spark-submit, except for the jar path, which requires the binary to be in an Object Storage bucket that is accessed with swift url specification. (NB : this command could have been created with a call to the OVHCloud Data Processing API).
Starting from this point, Novagen can now finely tune its processes portfolio and play with the allocation of different power with little limitation (except quotas of you public cloud project).

Finally, for tuning and post-mortem job analysis, one can take advantage of the saved log files. It is noteworthy that Data Processing offers a real time display of job logs, which is very convenient, and a complementary supervision through Grafana dashboards.

This is a first yet significant test of Data Processing. Until now it proved an excellent match with the Novagen quality process use case and allowed to validate several crucial point when it comes to test a Data solution.

“This is the beginning of this product, and we will have a close look at the upcoming functionalities. OVHCloud team unveiled part of its roadmap, and it looks really promising.”

Novagen : We are Data innovators !!!

⇒ As a consultancy company, We build complete and innovative data strategies for our demanding customers

  • Top Fortune Bank, Reglemetary, Retail, Fashion, Transportation,
  • BI at extreme scale, Data Lake creation and management, Business innovation with Data Science

⇒ With our Data Lab, We are continuously improving our technology portfolio :

  • Selecting, assessing, benchmarking solutions,
  • Developing ‘boosters’ : ready to deploy or customized data assets.

We periodically communicate about our innovations. For instance :

  • How to leverage Kubernetes to empower your multi-cloud strategy,
  • Apache Kylin, Apache Druid : technologies for ultra high scale Business Intelligence
  • Data science automation, from notebooks to operational Machine Learning Models

La technologie au service des besoins métiers.

Expérimentation, Méthode et Industrialisation.

Pin It on Pinterest