Course Duration : 5 Days

Data Science and Big Data Analytics

This course builds on skills developed in the Data Science and Big Data Analytics course. The main focus areas cover Hadoop (including Pig, Hive, and HBase), Natural Language Processing, Social Network Analysis, Simulation, Random Forests, Multinomial Logistic Regression, and Data Visualization. Taking an “Open” or technology-neutral approach, this course utilizes several open-source tools to address big data challenges.

Upon successful completion of this course, participants should be able to:

 Immediately participate and contribute as a Data Science Team Member on big data and other analytics projects by:

o Deploying the Data Analytics Lifecycle to address big data analytics projects

o Reframing a business challenge as an analytics challenge

o Applying appropriate analytic techniques and tools to analyze big data, create statistical models, and identify insights that can lead to actionable results

o Selecting appropriate data visualizations to clearly communicate analytic insights to business sponsors and analytic audiences

o Using tools such as: R and RStudio, MapReduce/Hadoop, in-database analytics, Window and MADlib functions

 Explain how advanced analytics can be leveraged to create competitive advantage and how the data scientist role and skills differ from those of a traditional business intelligence analyst

This course is intended for individuals seeking to develop an understanding of Data Science from the perspective of a practicing Data Scientist, including:

  •   Managers of teams of business intelligence, analytics, and big data professionals
  •   Current Business and Data Analysts looking to add big data analytics to their skills.
  •   Data and database professionals looking to exploit their analytic skills in a big data environment
  •   Recent college graduates and graduate students with academic experience in a related discipline looking to move into the world of Data Science and big data
  •   Individuals seeking to take advantage of the EMC ProvenTM Professional Data Scientist Associate (EMCDSA) certification

The following modules and lessons included in this course are designed to support the course objectives:

  •   Introduction and Course Agenda
  •   Introduction to Big Data Analytics
    • ▬  Big Data Overview
    • ▬  State of the Practice in Analytics
    • ▬  The Data Scientist
    • ▬ Big Data Analytics in Industry Verticals

      •   Data Analytics Lifecycle
        • ▬  Discovery
        • ▬  Data Preparation
        • ▬  Model Planning
        • ▬  Model Building
        • ▬  Communicating Results
        • ▬  Operationalizing
      •   Review of Basic Data Analytic Methods Using R
        • ▬  Using R to Look at Data – Introduction to R
        • ▬  Analyzing and Exploring the Data
        • ▬  Statistics for Model Building and Evaluation
      •   Advanced Analytics – Theory And Methods
        • ▬  K Means Clustering
        • ▬  Association Rules
        • ▬  Linear Regression
        • ▬  Logistic Regression
        • ▬  Naïve Bayesian Classifier
        • ▬  Decision Trees
        • ▬  Time Series Analysis
        • ▬  Text Analysis
      •   Advanced Analytics – Technologies and Tools
        • ▬  Analytics for Unstructured Data – MapReduce and Hadoop
        • ▬  The Hadoop Ecosystem

      o In-database Analytics – SQL Essentials

      o Advanced SQL and MADlib for In-database Analytics  The Endgame, or Putting it All Together

      • ▬  Operationalizing an Analytics Project
      • ▬  Creating the Final Deliverables
      • ▬  Data Visualization Techniques
      • ▬  Final Lab Exercise on Big Data Analytics