Cloud-based Data Lake solution
Adanto democratizes data with a cloud-based Data Lake solution
Adanto provides easy access to raw data for many departments of the Silicon Valley-based global leader in the Consulting and Professional Staffing industry and helps institutionalize data-driven digital culture throughout the entire enterprise.
Description
Data lakes are enterprise-wide data management platforms that store disparate sources of data in their native format until a client queries it for analysis. So, rather than putting the data in a purpose-built data store, data is moved into a Data Lake in its original format.
By consolidating data, the information silos are eliminated which increases information use and sharing. It also lowers costs through server and license reduction, cheap scalability, flexibility for use with new systems, and the ability to keep the data until the data consumer – a programmer or a business user – is ready to use it.
Challenges:
- Poor agility and accessibility for data analysis
- Data and Information silos
- Lack of information use and sharing for business decision making
- Increasing cost of multiple server and license proliferation, IT complexity
- Very expensive scalability, lack of flexibility for use with new systems
Services performed
Data Science
Data Analytics & Business Intelligence
Data Warehousing
Big Data
Machine Learning
Artificial Intelligence
DevOps
Security
Infrastructure Services
Salesforce
Amazon Cloud
Azure Cloud
Key goals
- Single store for all the raw data for anyone in department to analyze
- Set of incremental load processes
Data governance procedures
Building thematic , departmental, business line – central data marts
Building analytic applications for various business needs
Solution
Data lakes are infrastructure components supporting systems of innovation. Systems of innovation target creating new business models, products, or services with a fail-fast mentality. However, successful innovation means making investments to scale. It is this last point, around scaling innovation, that requires a deliberate approach to designing the data lake and integrating it with your existing infrastructure to make the leap from experiments to reliable insights
- Data stored in an inexpensive data store – the Amazon S3 Bucket
- Structure built in parquet file format used in HDFS/Hive to query data
- Cloud-based Hadoop/Sparc cluster set up in AWS data center with autoscale functionality.
- Incremental load processes run on EMR cluster in AWS and execute daily data pull using Apache Sqoop
- Unleashed the power of business intelligence at the business user fingertips
- Provided custom reporting & reporting tools
- Enabled machine learning efforts to uncover the hidden potential of the available data
- Optimized and automated the business processes basing on related data
Technologies used
- Data Sources/Silos:
- >60 data sources
- >200 GB of new data per day
- One Data Store (Data stored in different AWS cloud-based data stores based on data type)
- Amazon S3
- Amazon EC2 (Elastic Compute Cloud service for secure, scalable compute capacity)
- Amazon Redshift (data warehouse for standard SQL queries & BI tools)
- Amazon RDS (relational database for many instance types: PostgreSQL, MySQL, Oracle Server, Microsoft SQL Server)
- Apache Sqoop (O/S tool for bulk data transfers)
- Amazon HDFS (Parquet) (Hadoop Cluster with EMR – Elastic MapReduce)
- Query Tools & Analytics
- Apache Hive, Pig, Spark (O/S database query interface tools to HDFS & processing engine)
- R (O/S statistical programming language for data mining and statistical computing)
- Mahout/scikit-learn (O/S tools for building Machine Learning apps)
- Pentaho, QlikView, PowerBI, SAS (data analytics, business intelligence and reporting tools)
"Adanto has helped us in our first phase of creating DataLake and gathering data in centralised location”
Sean Perry, CIO