Week 20

Published on Author malmLeave a comment

Elements of Machine Learning V: spaCy and Natural Language Processing

[avatar user=”malm” size=”small” align=”left” link=”file” /]

Over the last four weeks I’ve been running through an exercise to demonstrate how machine learning techniques can be applied to a number of specific problems including fitting functions to a set of data points to how to select the best classifier for a given dataset. The last couple of  posts have focussed specifically on word vectors. The linking theme across the various discussions and code samples provided has been the use of Python and specifically the scikit-learn machine learning library.

This week’s post builds on last week’s one where I focussed on  basic sentiment analysis using scikit-learn but also introduced the spaCy natural language processing framework.  I indicated that it could be used in conjunction with scikit-learn to build a more sophisticated approach.   This code sample is a flavour of what I did to integrate the two.  In it I load up the same IMDB database csv file built last week to a pandas dataframe, process the data with spaCy to check that side is all present and correct and then integrate it into a scikit-learn pipeline.  The tokenizer here is built using spaCy and the text transformer using simple Python replace calls.  To run this code I set up an AWS 2GB t2.small instance running 64-bit Ubuntu 14.04 with a Python 3.4 virtualenv with a complete scikit-learn stack.  It helped convince me that the ability to easily switch on/off cloud-based machine learning environments for rapid prototyping and investigations will be an important factor in popularising these approaches:

import spacy 

def bestClassifierModel(X_train,y_train):
 vectorizer = CountVectorizer(tokenizer=tokenizer, ngram_range=(1,1))
 clf = Pipeline([('xform', TextTransformer()),
 ('vect', vectorizer),
 ('clf', LogisticRegression(random_state=0))
 params = {'vect__stop_words':None,'vect__tokenizer':tokenizer,
 'clf__C':10, 'clf__penalty':'l2'}
 clf.fit(X_train, y_train)
 return clf, vectorizer

# ---------------------------------------------
df = pd.read_csv('movie_data.csv')
# ----------- Try out classifier --------------
N = 10
print("Checking first %d reviews:%s" % (N,df.head(N)))
parser = spacy.en.English()
for i,review in enumerate(df.review[:N]):
  corpus = '.'.join(df.review[:N])
p = parser(corpus)
for i,sent in enumerate(p.sents):
  print("---- sentence %d ----" % i)
  for j,token in enumerate(sent):
# ----------- Try out classifier --------------
ratio= 0.75
X_train, X_test, y_train, y_test =   
clf, vectorizer = bestClassifierModel(X_train,y_train)
testscore = clf.score(X_test,y_test)
print("----\nTest accuracy: %.3f\n----" % testscore)

It took approx 12 minutes to run this code and the test accuracy I achieved using this approach was surprisingly low (well under 90%). It’s possible I’m not doing something right.  If I get the chance I’ll try to look into why a ‘simple’ unigram classifier apparently performs better on classifying IMDB review sentiment than spaCy.

Artificial Intelligence and Robots

  • ZDNet outline the three reasons why AI is taking off right now: i) big data, ii) inexpensive commodity GPU hardware and iii) public cloud and also present a 10 point plan to allow your business to take practical advantage of the coming ‘Enterprise AI’ wave:
    1. Embrace the idea that machine intelligence will matter to your organization.
    2. Identify which forms could be most important to your firm.
    3. Check out relevant start-ups and developments.
    4. Understand which parts of your firm could be safely run by algorithms.
    5. Determine which internal and external data sets have the most potential.
    6. Assess the extent to which your firm’s key professional expertise can be automated.
    7. Try out deep learning, neural computing and other technologies.
    8. Map the relevant AI services and technologies to your firm’s value chain.
    9. Develop machine intelligence experts in your organisation.
    10. Factor AI advances into your strategic planning.
  • Google Director of Research Peter Norvig’s excellent recent EmTech keynote on the the state of the art in AI conveys a similar message of serious change looming in the near distance for any remaining businesses that assume it doesn’t affect them:

“Every industry will feel the impact of both established and emerging artificial intelligence techniques.”

Boston Dynamics atlas

  • Post IO, it is clear Google’s focus has shifted to TensorFlow and “AI as a platform”.  This includes the provision of specialised hardware support in the form of the TPU which has been well-received by industry analysts puts them in direct competition with NVidia.
  • Some companies are already taking dramatic steps to accommodate AI advances with seismic impact on their human workforce.  The news that Foxconn has ‘replaced 60,000 factory workers with robots‘ is a good example and was widely covered last week.

the Web-connected Apple speaker in development would allow people to turn on/off or change settings for home appliances and other devices that support Apple’s “HomeKit” software, says the person with direct knowledge of the project. Such items include lights, sensors, thermostats, plugs and locks.

  • This rock sorting robot is admittedly less centre stage in application than a Siri speaker, though far more interesting to watch:

There are over 80 different countries that have military robots. They aren’t doing this because they think it’s cool. They are doing it because they think it gives them some kind of advantage in either a current conflict or a future conflict.



It has drawn up lists that rank top phone makers by how up-to-date their handsets are, based on security patches and operating system versions, according to people familiar with the matter. Google shared this list with Android partners earlier this year. It has discussed making it public to highlight proactive manufacturers and shame tardy vendors through omission from the list

  • A key practical obstacle they will face is that most Android OEMs are already in a very tough business and it’s far from clear that shaming them on releases will make any difference to their behaviour.  Google may find that offering to provide specialist update engineering support to cash-strapped manufacturers would go down better.
  • A factor that will compound OEM belligerence is that evidence is emerging that Android Wear watches are not proving to be a compelling product proposition in spite of Google’s major v2.0 update of the platform announced at IO.  At least that seems to be the conclusion Samsung have reached in apparently opting to switch to Tizen instead for various reasons including interestingly battery performance.  Google announced a major update of the platform to v2.0 at IO.

“Today’s verdict that Android makes fair use of Java APIs represents a win for the Android ecosystem, for the Java programming community, and for software developers who rely on open and free programming languages to build innovative consumer products”

  • Google also announced a free cloud-based tool called Data Studio 360 intended to help users with friendly visualisation support.  It could prove a significant competitor for the likes of Tableau and Qlik in due course both of which are distinctly non-free:


Of the 15 countries where WeChat’s Messi ad was aimed at – Argentina, Brazil, Hong Kong, India, Indonesia, Italy, Malaysia, Mexico, Nigeria, the Philippines, Singapore, South Africa, Spain, Thailand, and Turkey, points out Ad Age – the results look grim for the company. In not one single country does WeChat appear to be ahead of WhatsApp or Facebook.

“The Internet is as much a tool for control, surveillance and commercial considerations as it is for empowerment.”

  • The Indian concept of jugaad or ‘frugal innovation’ has been highlighted before.   It has reached its ultimate incarnation in the reusable mini space shuttle which may give Space X et al a run for their money in terms of costs.  Perhaps future commercial space projects will get outsourced to India not just the software systems behind them:

The Internet of Things

Engineering students and professors at the University of Washington have developed a way for passive IoT devices to receive energy via an RF carrier wave transmitted by an active WiFi power source. One central power source plugged into a wall outlet cpran transmit energy wirelessly to the passive devices, allowing sensors to send data without the need for power-hungry RF circuitry onboard.


Apps and Services


Anti Mobile

  • Evidence that excessive phone use is seriously harmful to your ability to concentrate and may even be responsible for inducing ADHD-like symptoms comes in this Medium post from Tristan Harris who apparently did a stint as “Product Philosopher” at Google.  That sounds like an intriguing job.  Wonder how much practical remit came with that territory.  The article is full of useful tips on how to turn your phone into a genuine minimum viable product:
    • Minimize Compulsive Checking & Phantom Buzzes
    • Minimize Fear of Missing Something Important
    • Minimize Unconscious Use
    • Minimize “Leaky” Interactions (“leaking out” into something unintended)
    • Minimize Unnecessary Psychological Concerns generated by the screen.

  • Relatedly, a blogger in the Guardian suggests that by focussing on taking and uploading images of an event rather than experiencing the event per se, essentially constitutes contributing to one’s obituary rather than living.  Somehow many of us seem conditioned to believe that this simulacrum is somehow ‘even better than the real thing’ as Umberto Eco referred to it in Travels in Hyperreality.  Whenever this topic comes up (as it regularly does) I am often transported back to Hyde Park in 1997 and the sight of thousands of people glued to the giant screens showing Princess Diana’s funeral cortege travelling up Park Lane seemingly oblivious to the fact that the real thing was trundling past a short distance to their right.  It felt shocking then and would have been inconceivable for previous generations but it seems entirely accepted now to behave in this way.
  • An all-glass iPhone is in the offing.  Besides the obvious irony of one of the most notoriously secretive tech companies producing the ultimate transparent device, it’s not clear (see what I did there) what the user benefit of a glass phone is.

Nokia and Microsoft have slashed thousands of Finnish jobs over the past decade, and the lack of substitute jobs is the main reason for the country’s current economic stagnation.

Software and Maths


Agreeing to features is deceptively easy. Coding them rarely is. Maintaining them can be a nightmare. When you’re striving for quality, there are no small changes.

  • And yet the one thing you can be sure of with software is that everything absolutely will change around you:

  • The mutability of software stands in stark contrast with the immutability of mathematics.  This excellent post explains why is so important to keep your mathematical skills fresh irrespective of the type of coding you do in the day job.   All the more so in the current development landscape:

software development is quickly shapeshifting. If you discount mathematics, and in turn focus on learning transitory programming tools, you’ll be left without the skills necessary to adapt to emerging computer science concepts that have already started infiltrating engineering teams today. … In the next 10 years, software engineers aren’t still going to be limited to programming web and mobile apps. They’ll be working on writing mainstream computer vision and virtual reality apps, working with interesting cryptographic algorithms for security and building amazing self-learning products using machine learning. You can’t go very far in any of these fields without a solid mathematical foundation.


  • Going back to basics, here’s an interesting presentation of a more intuitive notation for power/root/log relationships built around the “triangle of power” with the three numbers involved in the relationship at the vertices and the edges between them defining their specific relationships:

Work and wellbeing

The biggest takeaway for me, is learning …  that in order for me to truly feel the pulse of life, to experience that fullness of life, I just have to keep treading into the unknown, to keep letting go of what I know, in order to come close to knowing life for what it is.

a recently published study of 16,426 working adults in Norway found that those with workaholism are significantly more likely to have psychiatric symptoms.


What is most frightening about this lawsuit is that the press has always played a significant role in defending the small and powerless against the big and powerful. Gawker has played this role in its own tabloid style, but Thiel’s funding of this lawsuit shows how money can protect that power through third-party litigation funding.

Thiel’s secret laundering of the Gawker lawsuit disqualifies him as someone who should be on a board of directors of any organization that claims to value freedom of expression. Facebook’s other directors, employees, and users should ask how much they want to be associated with a company that keeps someone like Thiel in a position of such power and influence.

  • Swirling underneath it all is the question of whether Trump’s politics constitutes fascism or not.  The NYT summarized the current opinion landscape on that which includes this attempt to describe what’s going on without invoking Hitler:

“It seems to me in developed and semi-developed countries there is emerging a new kind of politics for which maybe the best taxonomic category would be right-wing populist nationalism”

  • The New Yorker’s Adam Gopnik has been painting a progressively dire vision of the US under President Trump over the last few months.  His latest article is his most urgent yet urging US voters not to accept “a declared enemy of the liberal constitutional order of the United States”.  One suspects that he’s preaching to the converted with those that need converting simply not able or willing to listen:

If Trump came to power, there is a decent chance that the American experiment would be over. This is not a hyperbolic prediction; it is not a hysterical prediction; it is simply a candid reading of what history tells us happens in countries with leaders like Trump. Countries don’t really recover from being taken over by unstable authoritarian nationalists of any political bent, left or right—not by Peróns or Castros or Putins or Francos or Lenins or fill in the blanks. The nation may survive, but the wound to hope and order will never fully heal.

  • Mike Godwin author of the infamous eponymous Law of Hitler Comparisons feels compelled to clarify his most famous meme following its dramatic invocation by two previous London Mayors as well as a range of assorted anti-Trump commentators:

the purpose of Godwin’s Law was never to be predictive — instead, I designed the law to create a disincentive for frivolous or reflexive Hitler or Nazi comparisons so that, when we do feel compelled to make them in our arguments, we are more likely to be mindful about them.

Leave a Reply