|
 |
Syllabus |
We will cover the following subjects:
- Review of basic internet technologies: HTML, PHP, Java HttpURLConnection
- Introduction to Information Retrieval (text).
- Inverted indices and boolean queries.
- Query optimization.
- Unstructured vs semi-structured text.
- Text encoding: tokenization, stemming, lemmatization, stop words, phrases.
- The vector space retrieval model.
- tf.idf weighting. Scoring documents. The cosine measure.
- Introduction to data clustering.
- Partitioning methods: k-means clustering| Hierarchical clustering
- Introduction to text classification. Naive Bayes models. Email-Spam filtering.
- The structure of the Web graph.
- Zipf's and Pareto's Laws.
- Web search overview, web structure, the user, paid placement, search engine optimization/spam
- Web Crawling and web indexes
- Link analysis; PageRank and HITS ranking methods
- Recognizing web spam with statistical and graph-theoretic methods
- The Social Web: Social networks, Blogs, Trust
- Web Communities discovery
|
|
|
|