Course Information:

  • IDS 561 - Analytics for Big Data
  • Spring 2017
  • Monday 3:00pm - 5:45pm
  • DH 220
  • yuhenghu at uic dot edu


The “big data” paradigm has drawn a significant amount of attention in recent years as costs of acquiring and storing data have plummeted. Instead, bottlenecks have been shifted to fast and in-depth analysis. However, this shift has created its own set of problems, the most obvious one is that large datasets are often computationally expensive to process. Algorithms that are efficiently capable of processing data that fit in memory may become prohibitively expensive to use on larger datasets. Consequently, it can be difficult to gain insights from the underlying data.

This course is an introductory course for big data analytics and data science. It has three main goals. First, it is intended to provide the student with an appreciation for the issues involved in doing data science – classification, clustering, and dimensionality reduction – to work on datasets that do not fit in main memory. Second, it is intended to provide a working knowledge of and experience with some of the current distributed frameworks (e.g. Hadoop). Third, the course is intended to provide students with hands-on opportunities to implement solutions using real-world datasets.

Academic Integrity

You are expected to adhere to the highest standards of academic honesty. Unless otherwise specified, collaboration on assignments is not allowed. Use of published materials is allowed, but the sources should be explicitly stated in your solutions. Violations will be reviewed and sanctioned according to the University Policy on Academic Integrity. Collaborations among team members are only allowed for the final term projects that are selected. "Academic integrity is the pursuit of scholarly activity free from fraud and deception and is an educational objective of this institution. Academic dishonesty includes, but is not limited to, cheating, plagiarizing, fabricating of information or citations, facilitating acts of academic dishonesty by others, having unauthorized possession of examinations, submitting work for another person or work previously used without informing the instructor, or tampering with the academic work of other students." For more information about violations of academic integrity and their consequences, consult


Java Programming Background (IDS 400 or 401) and experience in Algorithm and data structures, Probability, Statistics, and Matrices, and Database query language such as SQL are required.

Recommended textbooks


  • Class activity: 15%
  • Projects: 85% (see requirement in Blackboard)

In class survey

  • about [email protected] class: click here

Weekly Schedule

Week 1 Intro to everything Reading Material: A Very Short History Of Big Data, Eight (No, Nine!) Problems With Big Data , The Parable of Google Flu: Traps in Big Data Analysis, Chapter 1. Data Mining, Machine Learning and Cognitive Systems: The Next Evolution of Enterprise Intelligence (Part I)
Week 2 MLK day; no class
Week 3 MapReduce and Hadoop Reading Material: MapReduce: Simplified Data Processing on Large Clusters, The Google File System, Hadoop Cluster Setup, Running Hadoop on Ubuntu Linux (Single-Node Cluster)
Week 4 MapReduce and Hadoop II & Lab Session Reading Material: MapReduce: Simplified Data Processing on Large Clusters, The Google File System, Hadoop Cluster Setup, Running Hadoop on Ubuntu Linux (Single-Node Cluster)
Week 5 Query Processing on Hadoop Reading Material: Comparing Pig Latin and SQL for Constructing Data Processing Pipelines, When to use Pig Latin versus Hive SQL?, Hive Tutorial, Apache Pig Tutorial – Part 1
Week 6 Managing big data Reading Material: NoSQL movement, SQL VS. NOSQL- WHAT YOU NEED TO KNOW, What is NoSQL? , HBase Tutorial
Week 7 Managing big data: 2 Reading Material: Better Explaining What is CAP theorem, SQL VS. NOSQL- WHAT YOU NEED TO KNOW, What is NoSQL? , HBase Tutorial
Week 8 Cloud Computing Reading Material: IBM - What is cloud computing?, Above the Clouds: A Berkeley View of Cloud Computing, Amazon Web Service)
Week 9 Processing and Analyzing Big Text Data Reading Material: Data-Intensive Text Processing with MapReduce, Large language model in machine translation
Week 10 Graph analysis using Big Data Reading Material: Breadth-first search, Data Structure - Breadth First Traversal, Pagerank)
Week 11 Spring Break
Week 12 Clustering and Recommender System using Big Data Reading Material: Item-Based Collaborative Filtering Recommendation Algorithms, Collaborative filtering, Chapter 9: Recommender system)K-means, Cluster Analysis, Chapter 7: Clustering
Week 13 Sentiment analysis using Big Data Reading Material: Sentiment analysis, A toolbox for sentiment analysis, 10 Sentiment Analysis Tools to Track Social Marketing Success)
Week 14 Streaming processing with Spark Reading Material: Real Time analytics for Big Data: Facebook's New Realtime Analytics System, Building Realtime Big Data Services at Facebook with Hadoop and HBase - Jonathan Gray, Facebook, Apache Spark @Scale: A 60 TB+ production use case)
Week 15 Streaming processing with Kafka Reading Material: Companies Powered By Kafka, Apache Kafka for Beginners , Handling five billion sessions a day – in real time)
Week 16 Final Presentation