Jonathan Tanner CS100W Blog

Sunday, December 15, 2013

Scientific Computing: Modeling Neuroplasticity

The pursuit of strong artificial intelligence involves numerous areas of research. While a lot of current research in AI is focused on achieving various intelligent tasks that the brain is capable of, disciplines like computational neuroscience seek to understand how the brain works and achieves certain tasks in general. In a recent study from the field, researchers at MIT have been able to model how the mind is able to learn new things, known as neuroplasticity, while still retaining old things it's learned.

The research has shown that neurons are constantly trying out new configurations of how they connect to other neurons to allow for the brain to learn as many tasks as it needs to and find the best configuration. This allows neurons to specialize in certain tasks while others are still able to learn new tasks.

One key element in the study that was as yet not widely explored was determining how noise acts within the model. The researchers found that noise could actually benefit the model by exploring more new connection configurations when the model is hyperplastic. The researchers concluded that the noise actually helped the model learn a variety of new things while retaining the ability to do old rather than hindering it. The model also helps explain how skills can diminish when not practiced often enough since the new connections will eventually start to overwrite old skills after too much time has elapsed.

Only time will tell if this research will lead to further research and findings on the subject or just remain an interesting fact. Regardless, any breakthrough such as this in computational neuroscience helps towards both our understanding of how the human brain works in general as well as the long term goal of trying to create an intelligence on par with it.

Computer Graphics: Breakthough in Image Pattern Detection

In a research experiment involving unsupervised machine learning, Google may have discovered the most significant image pattern on the internet, or at least on Youtube where the experiment was performed. The system was designed to detect and rank imagery patterns which could then be analyzed for what they represent by researchers. It may come at no surprise, then, that this system succeeded in detecting what is important to the users of Youtube and most of the internet in general: cats.

What was originally designed to detect significant patterns in imagery data became the world's first cat detector after the results showed that imagery of cats was some of the most detected and thus significant imagery data after being trained on Youtube videos over the course of three days. According to an article on Slate, linear tool-like objects held at about a 30 degree angle were also common features detected and after adding a round of supervised learning, the classifier could detect human faces with around 82% accuracy.

Perhaps not the most useful breakthrough in unsupervised learning on image data on the surface, the model does prove the capability of emerging methods to detect meaningful features in images and video. Perhaps with some tweaking the system will also be able to detect dogs, but no breakthroughs in this area have emerged yet.

Communications and Security: Knowing How Your Code Works

It may seem obvious when said, but knowing exactly how your code works and what it is doing, especially with fringe cases, is a very important aspect of security. When an attacker is able to determine a bug or "feature" that your code has that you are unaware of, it can often be exploited to varying degrees of maliciousness. The well-known website Spotify learned this lesson first-hand when an exploit involving the way they processed usernames was found.

The key mistake made was found to be allowing users to have usernames with any valid Unicode character while having the actual stored username be a more restricted version. The username that was actually stored and checked against for internal purposes was processed with the string.lower() method in Python, which apparently maps a large number of Unicode characters to the 26-character space of lowercase English ascii characters. As a result, it was possible for many different possible usernames to become the exact same username when processed by the lower() method. The attack itself allowed user accounts to be hijacked using this information.

Existing Spotify accounts could be hijacked by signing up for a new account with a username that maps to an existing username when processed by lower(), then submitting a password reset request. The new account creation process would associate the new email address as a change of email address to the old account and thus send the password reset request to that address instead of the one belonging to the original user. Once the password is reset one can log in as the original user with the new password.

While this is a more fringe case of not knowing exactly what code is doing going wrong--which also arguably makes it a more interesting one--it does show the importance of verifying code behavior as well as user inputs for security purposes. Luckily for Spotify the user who discovered this exploit reported it to them quickly, but the results could have been much worse under different circumstances.

Artificial Intelligence: Why You Should Have Paid Attention in Your Statistics Classes

While not a particularly new field in Computer Science, Artificial Intelligence has gone through many changes over the years with many advances and setbacks. As it has become increasingly widespread and marked with new successes over the past decade or two, one thing has become clear: statistics and probabilistic approaches seem to be the key to achieving continues success in just about all facets of the field.

One area with AI that has seen tremendous success from statistical and probabilistic approaches is that of Computational Linguistics. This area has a longtime rivalry between approaches based on hand-generated rules from linguists versus statistical/probabilistic approaches. As Peter Norvig notes, the majority of systems successfully solving a number of problems within Computational Linguistics use statistical and/or probabilistic approaches at least partially if not entirely. These include such popular and widespread problems and application areas as search engines, machine translation, and speech recognition. As new research in this area tends to be towards newer and better statistical and probabilistic approaches, this trend does not seem to be changing anytime soon.

Yet another area with traditionally more formal roots that has benefited greatly from probability and statistics is Graph Theory. The work of Thomas Bayes and Andrey Markov to formulate probabilistic graph-based models is very widespread through all of Artificial Intelligence these days and the applications of such models may be limitless. Such models are used widely in Computational Linguistics, pattern recognition, and Bioinformatics, just to name a few. Such models are capable of encoding the fluctuations, randomness, and uncertainty that are a part of most things we try to model.

Machine Learning is another popular area within Artificial Intelligence that makes heavy use of statistics. A lot of machine learning techniques are actually intended to solve problems from statistics such as linear and logistic regression using new approaches. A number of other parallels exist as well showing the interrelation between Machine Learning and Statistics.

While math in general always seems to be a driving force behind the discovery of new techniques and algorithms in software, statistics and probability in particular are becoming increasingly popular in computer science. It doesn't seem likely that computer science as a discipline will ever break away from its dependence on math, so to the CS majors out there be sure to pay attention in math class!

Sunday, November 17, 2013

History of Computer Science: Artificial Intelligence

“Artificial Intelligence research is an attempt to discover and describe aspects of human intelligence that can be simulated by machines.”
Philip C. Jackson, Introduction to Artificial Intelligence

Artificial Intelligence is a popular industry within technology today, but its history has been a rocky one filled with many transformations and long lulls between rapid booms. The definition itself has even evolved over the years in response to various innovations and setbacks. While the goal of AI has remained fairly constant, our understanding of that goal and innovations towards it have set milestones in its history. Two main types of AI have emerged over the years based on both limitations and advances: strong and weak AI.

Originally, artificial intelligence was the purely theoretical concept of machines that can mimic human learning and thoughts. This is now considered strong AI and while it ultimately remains the goal of the discipline, there is still a long way to go before it can be achieved. In the 1940s, early research involving modeling neural networks structured like mammalian brains helped introduce the discipline and shape further research pursuits. It was the Turing Test in the 1950s, however, that truly marked the popularity and rise in research into strong AI. Alan Turing devised a test that has become the standard for what would be considered (strong) artificial intelligence. A person at a computer terminal, called the interrogator, would be able to chat to two different agents, one human and one the proposed AI system. If the interrogator could not distinguish which agent is human with any significant probability the proposed system would be considered to exhibit artificial intelligence. This is more or less still considered an applicable test for strong AI and other tests have even been derived for it as our understanding of AI and the definition of it has evolved.

Over the next couple of decades some advancements were made in the field, particularly with problem solving applications and testing environments. Computer vision, robotics, and natural language processing in particular had advances in testing environments. By the 1970s, however, the rate of advancement in artificial intelligence simply wasn't keeping up with the demanding expectations held for it at the time. Despite the advances that were made, the practical applications were very rare and interest started to decline.

This lull lasted about a decade, but research continued and by the 1980s the sales in latest hardware and software for AI started to rise as the much needed practical applications for it started to rise. These applications led to the concept of weak AI where systems that were not intelligent in the strong AI sense but involving techniques considered vital to such a system and exhibiting intelligent qualities were used by themselves to solve a variety of interesting problems. Speech recognition and learning systems were two of the first areas that started to see an increase in such practical applications, followed by automatic programming systems that could write themselves based on desired behavior and many of the statistical and graph-based approaches that are widely popular today.

Through the 1990s and into the millennium many new applications and areas emerged, including machine learning, planning systems, and search systems. The use of weak AI techniques and systems has continued to increase to this day and now is very widespread with some major companies such as Google, IBM, and Microsoft leading the way in new research and applications. With this continuing trend, the possibility of strong AI once again emerges as a goal within the field. The hopes and expectations have even risen, such as the concept of The Singularity—a concept attributed to Vernor Vinge in which the intelligence of computers not only meets but exceeds that of humans. It is difficult to say if or when a truly intelligent system will be created, especially as the understanding of what properties such a system would be expected to contain evolves along with the techniques that might lead to such a system, but the field of artificial intelligence is certainly in interesting one that is not likely due for another decline in the foreseeable future.

“AI has the potential to change our world like no other technology.”
M. Tim Jones, AI Application Programming

File Sharing: Changing How Business is Done

What if I told you that you could get all the music, movies, books, and software you want for free? Well, the good news is that nowadays you can quite easily. The bad news is that it's illegal. File sharing is nothing new as far as the Internet is concerned, but over the years the attempts to mitigate it has paled in comparison to its proliferation. Although the term file sharing more generally refers to any context it implies, its usage today almost always refers to pirated content, or at the very least the services that primarily serve such content.

In the early days of content piracy file sharing websites would host files for download. Music was by far the most common type of content pirated then, taking off with the introduction of Napster. Software piracy was also quite widespread through various warez sites, but was not as widespread as music. The problem with file sharing sites, however, was that the content was all centralized in one or a few locations so taking down the sites could easily mitigate the piracy taking place. This ultimately led to the demise of Napster, but not before a new and significantly more resilient framework had already taken hold. Why host the files and tax your bandwidth when you could just as easily pass this on to the users? With this realization, peer-to-peer file sharing was born.

Unlike hosted content, peer-to-peer file sharing serves the content to those who want it from those who already have it. The users simply downloaded a client that handled this all for them and were free to download anything that was available from the other users. Early on clients like Kazaa and Limewire were quite popular, but ultimately faded as newer clients were developed. The critical turning point in this was Torrents, though. While the earlier clients relied on their own servers to coordinate the sharing and search functionality—once again a point of weakness in the system—torrents were a more general framework involving files that contained metadata about the files to download and the servers that were hosting them. This spread the server load across multiple networks and countries making taking them down more difficult as this would have to be done for each network individually. The torrent files themselves were available from multiple sites, gaining the same benefit. Some sites required signing up and often imposed restrictions for doing so to prevent those looking to take down the networks from being to do so as easily. Even multiple clients using the same framework became available eventually. This level of dispersal made taking down the framework next to impossible by those seeking to protect their content that was being pirated.

The various industries affected by file sharing continue to evolve as a result. Some of the most popular sites for downloading torrents such as The Pirate Bay have been constantly targeted with little to no effect as far as the users are concerned. Many industries have looked into or implemented increased security measures to try to combat piracy. Software has implemented activation requirements that track the number of copies a particular serial number has being used at a particular time, but this has done little to help since those pirating software tend to be more computer savvy and thus cracks that patch the software in such a way as to remove this security functionality are quite common. Electronic books attempted to add encryption to control the number of copies and redistribution. This also has been mostly in vain as all encryption schemes to date have been broken and the cost of developing them far outweighs the cost of publishing the book in the first place.

The music industry has by far been affected most by piracy and has gone through the most change as a result. This has a lot to do with it being the most popular type of content to pirate, but also through many unrelated changes and flaws in the business model itself. Music piracy is nothing new, dating back to the days of people copying cassette tapes. The industry itself has for a long time seemed inherently flawed, however, which in part aided the piracy movement. With the introduction of CDs, the music industry underwent a major shift in its pricing practices. Costing next to nothing to make, the industry marked up the price to exorbitant amounts expecting that people would just pay for the higher quality, and for a time they did.

The evolution of the Internet and introduction of file sharing worked to shatter this model, however. After years of price-gouging a large number of people felt little remorse pirating music and the knowledge of just how little of the music prices actually went to the artists—the people they actually did care about—did little to counteract this trend. One could easily support the artist by going to shows and buying merchandise without lining the pockets of the record industry. All the while this trend escalated the price of producing music dropped significantly through advances in technology, making the overpricing practices even more of an insult to music listeners.

These factors played a key role in shaping the music industry and its business model. Realizing that people would no longer pay for overpriced CDs and that the popularity of mp3s was starting to exceed that of physical media the music industry started selling digital copies of music at more reasonable prices and even on a song-per-song basis. How many CDs have you bought where you only liked one or two of the songs anyway? The drop in the price of recording has also been a hit to the record labels while at the same time benefiting music as a whole significantly. Small bands could now record their own studio-quality albums on their own instead of relying on labels. They could even distribute albums themselves through the same digital media outlets as the labels, such as iTunes and Amazon, or even press their own CDs since this cost has always been very low anyway.

The industries affected by piracy have had to adapt to survive, but it could easily be argued that any setbacks arose from flaws in the traditional business models to begin with or the inability to adapt quick enough and in a reasonable manner to a fast-changing world. The distribution of legal content has likewise been affected, in this case very positively as file sharing frameworks such as torrents provide a great framework for the mass distribution of content regardless of the legality of it. Open source software and content can easily be distributed through peer-to-peer file sharing without burdening the already struggling producers of it with the costs of hosting the content themselves. Regardless of one's standpoint on piracy, file sharing has certainly had a huge impact on many industries and how content is acquired. There is no question that it is here to stay and that embracing and adapting to it it is by far the best way to react to this change.

Friday, November 15, 2013

Data Structures: Making Computing Possible

Data is at the heart of modern computing. The majority of what a computer scientist will learn, use, and possibly even research comes down to two things: processing data and storing data. The former is achieved through algorithms and the latter through data structures. Algorithms define the means to achieve particular tasks, generally involving the processing of data, and are quite varied in what they do based on any number of possible problems that need to be solved. Data structures on the other hand, while still just as varied as algorithms, are singular in purpose. All data structures have the common goal of storing and allowing for the access of data, but their variety comes from the types of data and the pursuit of the most efficient ways of storing and accessing that data based on its type and the design of the system.

While there are countless algorithms out there, and in fact any explicit process for achieving a particular task is considered an algorithm, there are only a handful of data structures out there and the discovery of revolutionary new data structures is no small feat or event. They generally have trade-offs in efficiency based on their particular use as well, making knowing them even more important for programmers. The three main types of data structures come in the form of linear data structures, trees, and maps.

Linear data structures are, like the name implies, a stringing together of points along a line. Any piece of data has at most two neighbors to it, one before and one after. The most common types of linear structures are arrays and linked lists, although in concept both are structured the same and the efficiency of various actions is the key difference. Arrays, for example, allow for a particular data value to be looked up very efficiently while linked lists allow for a new value to be placed between two existing values very efficiently. In this case the opposite suffers in performance on the other's task. Linear structures can be nested to create multidimensional structures as well such as matrices, but ultimately the linear structure remains intact for each component.

Trees get their name from the branching out from a common point that they exhibit, much like tree branches. They generally involve some method for subdividing each branch to allow for much more efficient search operations. This, however, requires that the data be related in such a way that predictable comparisons be able to be made to determine which branch to take. If this sort of comparison can't be made and result in the same choice every time the structure will not work. Trees generally have the advantage of being efficient in both insertion and retrieval, which if you recall was a one-or-the-other choice with linear structures. While neither operation is as fast as the superior linear counterpart, the efficiency gain for both far exceeds the trade-offs of the linear structures.

Maps tend to stand alone as data structures in that they fit a particular format of data that trees and linear structures can't handle. They are used when a particular data point is mapped to by a particular associated value or values, commonly referred to as keys. A key is a unique, small value that has data, generally much larger in size, associated to it in such a way that when a particular key is searched for the associated data value is returned. For example, one could use people's names (provided they are unique) as a key to retrieve more extensive data on the person. Maps can be stored in trees where the ordering is determined by the key or can be stored as hashes, most easily thought of as a sparsely-populated array containing the keys so that a particular key can be looked up and its associated data returned more efficiently than the tree counterpart.

The specifics of data structures can be quite complex and provided is only a high level overview of some of the most common types and features, but nonetheless they are a very aspect of computer science and worth learning about for anyone interested in programming. Many books and websites exist on data structures as well as algorithms for more information on the topic and greater detail.