# Message Boards

Answer
(Unmark)

GROUPS:

2

# Surprising statistical patterns on MIT OpenCourseWare's YouTube channel

Posted 8 months ago

Statistical patterns on MIT OpenCourseWare’s YouTube channel The current essay analyzes the number likes, dislikes, and views at MIT OpenCourseWare’s YouTube channel, where some of MIT’s courses are published. The essay looks at 95 courses. An interesting pattern is that some lectures typically have significantly more likes and views than the other lectures in the same course. In the course 22.01: Introduction to Nuclear Engineering and Ionizing Radiation, for example, the most popular lecture has 185 times more views than the median! And the number of views of the n ( 2 R N≈+k -γ n k γ ||<<k γ Out[]=
Out[]=
Extracting the data Before beginning to analyze the data, we need to first gather it. This step is quite trivial, so reading this section is not that important. Now, getting to the technical part, getVideos returns the list of links to all videos in a given playlist, and videoData returns the number of likes, dislikes, and views for a given video. Together, they can be used to determine the number of likes and and dislikes for all videos in a given playlist. And since lecture courses come in the form of playlists, we can use the functions to find the number of likes, disliked, and views (later called the data) for a given course. What then remains is to make a list of courses and run the program. And lastly, it must be noted that the two functions below only work in Kazakhstan because they assume that the language of the YouTube page is Kazakh. If you run this from somewhere else - Europe, the US, Asia, etc - you would need to figure out what keywords to use when parsing the HTML text. This can be done within 30 minutes by opening some random video and looking through the HTML code of the YouTube player. If the number of views is 47329, then in the HTML you should look for the number 47329. The structure of the page is the same for all videos, so it’s possible to deduce a general rule for finding the data on the video. In[]:= ClearAll[getVideos];getVideos[playlist_] := Block[ text = URLRead[playlist]["Body"], shift, truncatedText, marker = "\"title\":{\"runs\":[{\"text\":", markerLen, markerChars, antimarker = "\"YouTube TV\"", antimarkerLen, antimarkerChars, record = True, markerPositions, i, iHelper, j , markerLen = StringLength[marker]; markerChars = Characters[marker]; antimarkerLen = StringLength[antimarker]; antimarkerChars = Characters[antimarker]; shift = StringPosition[text, marker][[1, 1]] - 1; truncatedText = StringJoin@StringPart[text, shift + 1;;]; markerPositions = StringPosition[truncatedText, marker][[All, 2]]; Table[ If[StringPart[truncatedText, pos+1;;pos + antimarkerLen] antimarkerChars, record = False]; If[record, iHelper = False; For[i = pos, True, i++, If[StringPart[truncatedText, i;;i + 2] {"u", "r", "l"}, If[iHelper, Break[], iHelper = True]] ]; For[j = i, True, j++, If[StringPart[truncatedText, j] ",", Break[]] ]; "https://www.youtube.com" <> StringJoin[StringPart[truncatedText, i+6;;j-2]], Nothing ], {pos, markerPositions} ]] In[]:= ClearAll[getVideoData];getVideoData[video_] := Block[ text = URLRead[video]["Body"], likesStart, i, dislikesStart, j, viewsStart, k , likesStart = StringPosition[text, "Басқа "][[1, 2]] + 1; For[i = likesStart, StringPart[text, i] ≠ " ", i++, Null]; dislikesStart = StringPosition[text, "басқа "][[1, 2]] + 1; For[j = dislikesStart, StringPart[text, j] ≠ " ", j++, Null]; viewsStart = StringPosition[text, "\"viewCount\":\""][[1, 2]] + 1; For[k = viewsStart, StringPart[text, k] ≠ "\"", k++, Null]; k--; ToExpression@StringJoin@StringPart[text, viewsStart;;k], ToExpression@StringJoin@StringPart[text, likesStart;;i], ToExpression@StringJoin@StringPart[text, dislikesStart;;j] ]
List of courses
Extract data from YouTube channel Here is the code that extracts the data from the YouTube channel: In[]:= ytData={#[[1]],getVideoData/@getVideos[#[[2]]]}&/@courseList; In[]:= ytData Out[]=
And since it takes forever to run, I’ve recorded the output it produces and copied it into the notebook like that, so that the code would not have to be executed every time: ytData =
It must be noted that the data above was gathered in January 2021, so by the time you read the essay, the actual numbers might change. ytData is a list of the form {..., {name, {lec1, lec2, ... }}, ...}, where name is the name of a course and lec1 represents the first lecture in the course. lec1 is of the form {views, likes, dislikes}.
The student-layman model and Zipf’s law
Introducing the model The first thing that we notice when looking at the number of views for the lectures in a given course is that the numbers are highly uneven. Here, for example, is the data for the course 6.172: In[]:= ListPlot[ReverseSort@ytData[[11,2,All,1]],PlotRangeAll] Out[]= Here, the vertical axis is the number of views and the horizontal axis is the lecture’s rank, with the most popular lecture being first, the second most popular lecture being second, and so on. The first lecture has 12.5 times more views than the median. And for the course 22.01, the ratio of the number of views of the most popular lecture to the median number of views is 185! It turns out that the number of times a lecture is viewed is very accurately described by a power law N≈+k -γ n k γ N≈-2.9· 3 10 4 10 -0.853 n In[]:= courseCKGModelPlot[11] Out[]= The coefficients ,k,γ 2 R 2 R In[]:= {courseslist,coursesklist,coursesγlist,coursesFVUlist}=Transpose[courseCKGModelFit/@Range[Length[ytData]]]; In[]:= Histogram[1-coursesFVUlist,AxesLabel{" 2 R Out[]= In[]:= Median[1-coursesFVUlist] Out[]= 0.973113 The median value of 2 R
Supplemental code
Interesting features Things become even more interesting if we plot the histogram of the parameter γ in the approximate law N≈+k -γ n In[]:= Histogram[coursesγlist,AxesLabel{"γ",""}] Out[]= In[]:= {Mean[coursesγlist],Median[coursesγlist]} Out[]= {0.976588,0.967773} This is a unimodal distribution centered at 1! This is exactly what we would expect from Zipf’s law, which tells us that the number of views under the n n k In[]:= Histogram[courseslist/coursesklist,AxesLabel{"/k",""}] Out[]= Most of the times, the ratio /k
Interpretation The number of views a lecture has can be represented as a sum of two components: students, who pedantically watch all the lectures in order, and “laymen”, who click on the lectures either accidentally or because of their loud names. The number of student views stays constant from lecture to lecture, while the number of laymen views changes. Comparing that to the empirical law N≈+k -γ n k k In[]:= ytData[[#,1]]&/@(Reverse@Ordering@coursesγlist) Out[]= {14.73,14.772,20.010J,16.01,6.832,15.S50,2.830J,8.962,22.01,5.112,6.01SC,18.650,16.687,20.219,18.217,9.00SC,6.042J,3.091,6.0002,6.450,6.172,15.S08,8.04,6.851,5.60,9.40,7.013,3.021J,6.858,18.S096,6.001,15.S12,6.262,6.S897,6.013,11.601,8.851,10.34,6.002,8.286,8.421,3.60,6.849,8.334,8.422,5.95J,6.451,18.086,2.003SC,7.012,6.890,18.01,8.333,8.821,14.01SC,8.05,6.033,8.591J,5.80,2.71,5.07SC,6.172,3.320,6.006,5.61,7.016,6.034,6.189,5.08J,6.041,18.085,8.03SC,9.04,18.065,18.03,6.868J,2.627,16.885J,6.003,6.02,18.02,15.401,18.03SC,3.054,7.014,15.031J,16.842,4.696,7.01SC,6.0001,6.046J,7.91J,18.06,5.111,2.003J} This interpretation can be tested by looking at how the course’s γ constant relates to its content. In the list below, you can see courses sorted by their γ’s, with courses whose γ is largest coming first. The course with the highest γ is 14.73, The Challenge of World Poverty; the next course is again of major 14, Development Economics: Macroeconomics (it seems like there is something special about the major). The third course is 20.010J, Introduction to Bioengineering. All the three courses have very loud names that can be exciting and attractive to a wide audience. But now, look at the course 2.003J, whose view count is the most uniform. Its name is Dynamics and Control I, which means nothing to an average person. The next course is 5.111, which has the modest name Introduction to Chemical Science. A possible explanation for this phenomenon in the framework of the student-layman model is that people are attracted by some of the courses’ bright names but later leave the courses because they didn’t fulfil the high expectations. Courses with modest names, on the other hand, are watched by people who know what they are looking for and hence won’t change their mind in the middle of the course. And can we use the interpretation to explain why the approximate law is N≈+k -γ n N≈+kExp[-γn]
Comparing courses
Max to median views The previous section looked at the unevenness in the number of views of the lectures. Continuing on this topic, let us compare the courses by how the number of views of their most popular lecture relates to the median number of views. In other words, let us compare the courses by Max[views]/Median[views] In[]:= ClearAll[getMaxToMedianViews];getMaxToMedianViews[courseData_] := Block[{views = courseData[[2, All, 1]]}, Max[views] / Median[views]]; In[]:= coursesByMaxToMedianViews=Reverse@Ordering[getMaxToMedianViews/@ytData] Out[]= {10,27,75,4,58,2,66,87,17,3,57,69,90,82,41,80,37,64,52,63,71,89,8,51,29,60,53,62,73,74,93,38,65,6,7,14,13,16,21,50,67,92,77,36,81,35,86,28,19,68,9,39,76,25,15,46,32,34,5,11,94,47,20,12,85,30,79,22,84,1,59,61,49,40,83,78,31,56,91,45,26,42,88,44,54,55,24,18,33,70,95,48,23,72,43} In[]:= ytData[[#,1]]&/@coursesByMaxToMedianViews Out[]= {22.01,20.219,6.001,18.217,14.73,3.091,6.01SC,5.112,18.650,8.962,6.006,6.832,16.01,3.60,8.04,5.60,6.849,14.01SC,8.851,18.03SC,2.830J,20.010J,16.687,6.851,6.858,6.262,2.003SC,6.042J,5.80,6.450,18.03,8.422,9.00SC,9.40,6.S897,5.08J,5.61,10.34,5.07SC,14.772,6.172,7.012,18.085,8.333,18.086,18.S096,6.451,6.890,11.601,2.71,15.S12,8.286,6.189,8.821,8.03SC,6.034,8.421,8.334,7.016,6.172,6.002,2.627,6.0002,18.065,6.013,8.591J,18.01,6.0001,6.033,15.S08,6.041,7.01SC,7.013,8.05,3.320,18.02,15.S50,6.003,16.885J,6.868J,6.046J,9.04,7.014,6.02,15.401,15.031J,3.054,16.842,7.91J,5.95J,18.06,3.021J,5.111,4.696,2.003J} As previously, 2.003J: Dynamics and Control I comes as the course with the most homogenous stats. The most uneven course is now 22.01: Introduction to Nuclear Engineering and Ionizing radiation. Similarly to the result we got by comparing γ
Total views Let’s now assess courses’ popularity by comparing their total number of views: In[]:= ClearAll[getTotalViews];getTotalViews[courseData_] := Total@courseData[[2, All, 1]]; In[]:= coursesByViews=Reverse@Ordering[getTotalViews/@ytData]; In[]:= ytData[[#,1]]&/@coursesByViews Out[]= {6.006,18.06,18.01,18.03SC,6.034,18.02,18.S096,8.04,18.03,14.01SC,7.01SC,5.60,6.042J,2.003SC,6.041,6.002,6.0001,6.0002,15.401,6.046J,9.00SC,6.001,18.085,18.650,18.065,22.01,5.111,6.858,6.01SC,6.003,15.S50,7.012,8.05,7.014,14.73,5.112,6.450,8.333,18.086,16.885J,2.627,8.286,7.91J,6.262,6.851,7.013,6.033,6.868J,8.03SC,3.60,15.S12,6.451,6.849,16.842,5.07SC,8.962,7.016,8.422,2.71,6.189,6.832,8.591J,9.04,6.02,15.031J,5.61,6.172,6.013,8.421,8.821,16.687,20.010J,6.890,8.334,3.054,2.830J,6.172,14.772,3.320,10.34,2.003J,8.851,3.021J,16.01,5.08J,18.217,5.95J,5.80,9.40,20.219,6.S897,15.S08,11.601,3.091,4.696} The most popular course is 6.006: Introduction to Algorithms, and the next most popular one is 18.06: Linear Algebra. These courses are relevant in fields like machine learning and IT, as well as a ton of other industries - probably that’s why they are so popular. Notice that the first seven positions are taken by math and computer science as the most “industrious” fields. The first non-math and non-CS popular courses are 8.04: Quantum mechanics I and 14.01SC: Principles of Microeconomics. The three least popular courses are 4.696: A Global History of Architecture, 3.091: Introduction to Solid State Chemistry and 11.601: Introduction to Environmental Policy and Planning. These are courses with very specific target audience and a comparatively limited practical application (apart from 3.091, which strangely has so few views). It is interesting that 6.S897, a computer science course, is the fifth least popular course, as CS is the most popular major. The explanation is probably that the course is very specific: its name is Machine Learning for Health Care, which is only going to attract a few narrow specialists. All in all, it can be seen that majors relevant in industry, such as computer science and math, are the most popular, while courses with a narrow target audience and comparatively limited practical applications get the least number of views. In[]:= {BarChart[getTotalViews@ytData[[#]]&/@coursesByViews[[;;7]],ChartLabels(ytData[[#,1]]&/@coursesByViews[[;;7]]),BarSpacing0.5,PlotLabel"Top 7 courses by the percentage of likes to views",ImageSizeMedium],BarChart[getTotalViews@ytData[[#]]&/@coursesByViews[[-7;;]],ChartLabels(ytData[[#,1]]&/@coursesByViews[[-7;;]]),BarSpacing0.5,PlotLabel"Bottom 7 courses by the percentage of likes to views",ImageSizeMedium]} Out[]= , In[]:= Histogram[getTotalViews/@ytData,40,PlotLabel"Total number of views",PlotRange{{0,2* 6 10 Out[]=
Likes to dislikes Having assessed the courses’ popularity, it would now be interesting to evaluate how enjoyable they are by comparing the total number of likes to the total number of dislikes they have. The idea is that the more exciting is the course, the higher is the ratio of likes to dislikes. There are of course some issues with this metric - for example, since there are far more laymen than students, the data we get will reflect the preferences of laymen - it’s still a fairly reasonable way to measure how exciting the course’s content is. So, here is the code: In[]:= ClearAll[getTotalLikes];getTotalLikes[courseData_] := Total@courseData[[2, All, 2]];ClearAll[getTotalDislikes];getTotalDislikes[c |