First Term of my Masters

My bad, I wasn’t expecting such a long post, but believe me, there is a lot to talk about! So here we go!

My first term, out of three is done. Well, not entirely. Just the classes. One assignment was out of the way today, 3 more to submit by mid-January and 1 until the end of January. And of course, 3 exams for this term’s modules in May.

It’s interesting for me to switch now to a different set of 4 modules for my second term. This is, in fact, one of the main reasons the Masters has been so challenging so far. The content for a class, which could easily span an entire academic year, is condensed to 10 weeks, maintaining a high level of complexity in terms of content.

Looking back on the first 2 and a half months of my Masters, I realise my skill set and I have both evolved tremendously. The pressure and the increased pace do make you create the impossible, possible. It’s a tough road towards the end result, but the feeling you get when you grasp the meaning of the concepts is priceless.

Although the MSc is entitled Web Science and Big Data Analytics, it has a high focus on machine learning. We are able to customise it anyway we like and by this I mean select whichever modules we like from all Computer Science MSc modules available.

My 2 core modules this term were Statistical NLP and Complex Networks. As elective, I chose Computer Vision and Supervised Learning. I made an ambitious choice, as I knew these were 2 tough modules, where expectations are high and the content is advanced. But I couldn’t say no to the challenge. The syllabus for both was incredibly interesting and indeed they turned out to be an excellent choice.

First off, in Computer Vision, we’ve covered the contents of Simon Prince’s amazing book Computer Vision : Models, Learning and Inference. I would highly recommend it for anyone interested in the subject. We learnt about 2D and 3D image geometry, object identification, tracking, face recognition, texture synthesis, all under probabilistic models. We also covered graphical models, random forests, homographies and particle filters. As William Freeman (MIT) put it, “computer vision and machine learning have gotten married and this book is their child”.

I was really really surprised when I received the grades for our first coursework submission and I found out I got an A+, equivalent to above 90%. As I mentioned, I had times when I thought I would never be able to develop what I was required, but pushing myself not to give up made me create the impossible, possible. And it feels amazing!

In Supervised Learning we had 3 different lecturers throughout the term, including John Shawe-Taylor, renowned for his book on Kernel Methods for Pattern Analysis. This module is equally demanding and has a strong focus on kernel methods. From our 2nd teaching week we covered kernels and regularisation, then moved on to online learning in our 3rd week. The module then introduced us to SVMs, Gaussian Processes, decision trees and ensemble learning, multitask learning, learning theory and sparsity methods. All with lots and lots and lots of maths.

Statistical Natural Language Processing was really really fun. It’s assessment is divided into 3 main assignments, the only class with no exam. The last one will be a group coursework and we’ll be required to use deep learning, with recurrent neural nets (yuppy yey!). We covered language models, with the traditional n-grams and we looked at machine translation, tagging and information extraction, such as entity recognition. I discovered there are some really really interesting things to do in natural language processing, now that deep learning algorithms enable a more accurate recognition of complex patterns. Languages have different levels of complexity. And training a model to learn a certain language really well is achievable. But there are so many aspects that make this task so difficult. Jargon, word ordering, neologisms, tricky meanings, all need to be accounted for. A really cool project UCL’s NLP research group received funding for, is the automated fact-checking project, sponsored by Google. They help avoid the spread of fake news via social media or search engines.

Last, but not least, in Complex Networks, we studied graph theory, random network models, as well as various ways in which networks’ properties can be analysed and help make informed predictions and decisions. Concepts such as the small-world property or the 6 degrees of separation, power laws and scale free networks have been covered throughout the term. A very important point made in this class was that links in networks carry meaningful information from the real world. For example, a trade is made between 2 countries (nodes in a network), due to an agreement, involving political, social, macro-economical reasons and so on. New links form with a reason, old links are broken with a reason. Its amazing how much predictable power network studies can have. The reason why this links really well with our Masters is because the web gave birth to a whole range of discoveries in the field of network science, due to its unprecedented complexity and scale. As Barabas notes in his book, “the Web is the largest network humanity has ever built. It exceeds in size even the human brain (N ≈ 1011 neurons)”. It is a fascinating network to study and in the past 2 decades it produced some of the best papers in network science.

I’ll finish off here, as this simply exceeds what I had in mind when I started writing. But hopefully this will prove helpful for anyone interested in pursuing an MSc in the future at UCL. I am happy to share more details of my experience and answer any other questions you might have about particular modules.

In term two, I am very eager to start studying Information Retrieval & Data Mining, Web Economics, Interaction Design and Advanced Topics in Machine Learning, taught by Google DeepMind’s researchers.


How do I see Big Data?

A New Year has arrived and I thought I should open up with my favourite tech topic of 2015. Even though, Big Data has been around for at least 3 years, this will be the year when people will realise its power. Well, this article is about how I would explain to somebody that has no technical background about what it means. Big Data is a term that has been unfortunately overused lately, but is, by true right, the biggest trend in technology we’ll experience this decade.

Data is around us every day and has helped businesses work more proficiently and efficiently for decades. We gather data in order to predict, for example, what are the times in a factory, when the employees are most productive. Correct analysis of the data leads to an increased productivity by improving strategy, which ultimately leads to financial gains.

However, the most difficult part here is to analyse this data. For example, Amazon knows that golf equipment is related to sunscreen. This way, when a user’s searching for golf equipment, they’ll receive recommendation to order sunscreen, too. The computer wouldn’t know these were related, if big sets of older data, wouldn’t have shown that a significant number of users usually order these products together or after a very short period of time.

As Jack Cutts states in his article on “Ghost in the machine: The Predictive Power of Big Data Analytics”, “correlation does not equal causation”. For example, the fact that I drink water every day and I also sleep every day have a strong correlation, but it does not mean that the fact that I drink water is the cause of my sleep. Therefore, here is the beautiful power of computers that would be able to read, diagnose and predict on this data in order to match correlation to causation.

A big threat to Big Data at the moment is the issue of privacy. People are afraid of something collecting data about themselves, as they don’t trust anonymity in this matter. It is very important for people to understand the benefits of this and be sure and confident that all data about themselves is stored anonymously. A lot of companies have made mistakes and have been too intrusive in the user’s personal life, proving to know a bit more than they were supposed to.

The medical industry has been trying for years to store the patients’ data electronically. This would help, for instance, analyse these millions of medical records and predict disease. The machine doesn’t have to know the identity of all the patients in order to read the data and determine any connection between a blood measurement and the incidence of blood-related diseases, such as leukaemia or diabetes. Identities are completely irrelevant. Data such as age or location might be the only relevant factors, but still, they won’t know to whom they belong to. It’s pure statistics enhanced by computing power.

To sum up, I would encourage all people who are a bit sceptic about institutions collecting their data to be more open towards this, as the vast majority of it will only benefit in predicting behaviours and help in prevention. We do share benevolently much more personal data over the Internet on social media websites like Facebook, which are more likely to threaten one’s safety. TV has been monitored for years in order to be able to help set the best advertising rates and the demographic profile of viewers. The power of Big Data is limitless and would only bump into barriers such as bandwidth, computing power or people’s imagination.