Getting started in big data
Most of the skills required to be a Big Data developer or analyst revolve around the Apache Hadoop Framework. A background in data is definitely a good start; however individuals with a background in ETL ( Extract Transform & Load), Data Warehousing and Business Intelligence/Analytics would have a much shorter learning curve. Below are links and helpful information to resources and certifications. The recommendations below are based on the least amount of skills/ qualifications required. The numbers in the circles denote the sequence in which to go down the learning path. Python and Java programming skills, though good to have are not critical to land a job, and is denoted by dotted lines. Scala programming on the other hand is an absolute must for Apache Spark.
7 required skills for a Big Data Developer/Analyst in the Federal Sector
- Hadoop HDFS Architecture: Contains Apache Hadoop architecture documentation of Hadoop Distributed File System (HDFS). Some of the main areas to focus on would be NameNodes, DataNodes and File System Namespace
- HDFS-File System Commands: Contains all the commands required to interact with HDFS from a shell interface. These commands are very similar to standard UNIX/LINUX commands. Mastery of these are critical to be a successful developer
- Apache Sqoop: Use to copy data from a traditional Relational Database (RDMS) into Hadoop Distributed File System (HDFS) and in reverse. Sqoop is relatively easy to master
- Scala Programming: Contains links to the official Scala documentation. Scala programming skills are a must before diving into Apache Spark. Even though Python and Java programming skills are listed recommended, Scala will suffice for the most part
- Apache Spark: In its simplest form, Spark provides a processing engine, capable of crunching large data sets in memory, or on disk on a scale much much faster than MapReduce. Mastery of Apache Spark is an absolute must to be a Big Data developer
- Apache Hive: Hive is Hadoop's version of a Data Warehouse application with capabilities to aggregate, analyze and query large data sets. HiveQL is very similar to traditional SQL and has a much lower learning curve.
- Apache HBase: Is Hadoop's version of a non relational column based database which provides real-time and random read/write access.
What Software do I need
- Cloudera QuickStart VM: The quickest and cleanest way to start learning without having to download, install and configure all the above Hadoop components separately.
How to get Certified
- CCA Spark and Hadoop Developer: Cloudera Certified Associate (CCA) is a valuable and recognized certification for Developers
- CCA Data Analyst: This is a highly recommended and recognized certification by Cloudera for those interested in the Data Analyst path
* For more detailed guidance or learning path info internal to our employees, email us at email@example.com or visit our archives
getting Started in Machine Learning/Deep Learning
Before diving into Machine Learning/Deep Learning, having a good grasp on Big Data skills, specifically Hadoop HDFS, HBase, Hive and Spark is highly recommended. Python, Scala and R programming are some of the prerequisite skills for Machine Learning (ML); however, Python would suffice to start with. Machine Learning(ML) involves heavy usage of algebra and statistics concepts, so keep things simple, start off with Python and add on R programming once you master the ML concepts.
Similar to our recommendations for the Big Data Learning path, below are links and helpful information to resources and certifications for Machine Learning/Deep Learning. The recommendations below are based on the least amount of skills/ qualifications required. The numbers in the circles denote the sequence in which to go down the learning path.
5 Required Machine Learning/Deep Learning skills in the Federal Sector
- Python Programming: The link contains numerous resources on building python programming skills for beginners, moderate and advanced developers. To download Python, click here.
- R Programming: The link contains documentation, and articles on R programming, including information on downloading and getting started with R.
- Machine Learning (ML): This online Machine Learning course offered by Coursera and created by Stanford University is the single most in detail and comprehensive course of its kind. To top it off, its taught by Andrew Ng, VP & Chief Scientist of Baidu; Co-Founder of Coursera; and an Adjunct Professor at Stanford University. This is a personal favorite and a must for our employees.
- Deep Learning (DL): Similar to the Machine Learning course, this course offered by Coursera, taught by Andrew Ng and created by deeplearning.ai is intense, comprehensive and guaranteed to make you a master in Deep Learning.
- Platform Specific: This link contains information on Apache Spark MLlib, which is Apache Spark's machine learning library.