What is the output of the python code below? my_list = [3, 2, 1] print(my

« Ch. 2 Fun with Data | Ch. 4 How to write code»

Inhaltsverzeichnis Show

3.1.1.Storing Single Values: Integers, Floating-Point Numbers, Booleans
3.1.2.Storing Text
3.1.3.Combining Multiple Values: Lists, Vectors, And Friends
3.1.4.Dictionaries
3.1.5.From One to More Dimensions: Matrices and $n$-Dimensional Arrays
3.1.6.Making Life Easier: Data Frames
3.2.Simple Control Structures: Loops and Conditions
3.2.1.Loops
3.2.2.Conditional Statements
3.3.Functions and Methods
16.2.Where To Go Next?
16.3.Open, Transparent, and Ethical Computational Science

Abstract This chapter introduces readers to the basics of programming, data types, control structures, and functions in Python and R. It explains how to deal with objects, statements, expressions, variables and different types of data, and shows how to create and understand simple control structures such as loops and conditions.

Keywords: basics of programming

Chapter objectives:

Understand objects and data types
Write control structures
Use functions and methods

Packages used in this chapter
This chapter focuses on the built-in capabilities of Python and R, so it does not rely on many packages. For R, only glue is used (which allows nice text formatting). For Python, we only use the packages numpy and pandas for data frame support. If needed, you can install these packages with the code below (see Section 1.4 for more details).

Python code

!pip3 install numpy pandas

R code

install.packages("glue")

After installing, you need to import (activate) the packages every session:

Python code

import numpy as np import pandas as pd

Now that you have seen what R and Python can do in Chapter 2, it is time to take a small step back and learn more about how it all actually works under the hood.

In both languages, you write a script or program containing the commands for the computer. But before we get to some real programming and exciting data analyses, we need to understand how data can be represented and stored.

No matter whether you use R or Python, both store your data in memory as objects. Each of these objects has a name, and you create them by assigning a value to a name. For example, the command x=10 creates a new object[1], named x, and stores the value 10 in it. This object is now stored in memory and can be used in later commands. Objects can be simple values such as the number 10, but they can also be pieces of text, whole data frames (tables), or analysis results. We call this distinction the type or class of an object.

Objects, pointers, and variables. In programming, a distinction is often made between an object (such as the number 10) and the variable in which it is stored (such as x). The latter is also called a “pointer”. However, this distinction is not very relevant for most of our purposes. Moreover, in statistics, the word variable often refers to a column of data, rather than to the name of, for instance, the object containing the whole data frame (or table). For that reason, we will use the word object to refer to both the actual object or value and its name. (If you want some extra food for thought and want to challenge your brain a bit, try to see the relationship between the idea of a pointer and the discussion about mutable and immutable objects below.)

Let us create an object that we call a (an arbitrary name, you can use whatever you want), assign the value 100 to it, and use the class function (R) or type function (Python) to check what kind of object we created (Example 3.1). As you can see, R reports the type of the number as “numeric”, while Python reports it as “int”, short for integer or whole number. Although they use different names, both languages offer very similar data types. Table 3.1 provides an overview of some common basic data types.

Python code

a = 100 print(type(a))

R code

a = 100 print(class(a))

Python output

Python	R	Description
Name	Example	Name	Example
int	1	integer	1L	whole numbers
float	1.3	numeric	1.3	numbers with decimals
str	"Spam", 'ham'	character	"Spam", 'ham'	textual data
bool	True, False	logical	TRUE, FALSE	the truth values

Let us have a closer look at the code in Example 3.1 above. The first line is a command to create the object a and store its value 100; and the second is illustrative and will give you the class of the created object, in this case “numeric”. Notice that we are using two native functions of R, print and class, and including a as an argument of class, and the very same class(a) as an argument of print. The only difference between R and Python, here, is that the relevant Python function is called type instead of class.

Once created, you can now perform multiple operations with a and other values or new variables as shown in Example 3.2. For example, you could transform a by multiplying a by 2, create a new variable b of value 50 and then create another new object c with the result of a + b.

Python code

a = 100 a = a*2 # equivalent to (shorter) a*=2 b = 50 c = a + b print(a, b, c)

R code

a = 100 a = a*2 b = 50 c = a + b print(a) print(b) print(c)

R output

[1] 200 [1] 50 [1] 250

3.1.1.Storing Single Values: Integers, Floating-Point Numbers, Booleans

When working with numbers, we distinguish between integers (whole numbers) and floating point numbers (numbers with a decimal point, called “numeric” in R). Both Python and R automatically determine the data type when creating an object, but differ in their default behavior when storing a number that can be represented as an int: R will store it as a float anyway and you need to force it to do otherwise, for Python it is the other way round (Example 3.3). We can also convert between types later on, even though converting a float to an int might not be too good an idea, as you truncate your data.

So why not just always use a float? First, floating point operations usually take more time than integer operations. Second, because floating point numbers are stored as a combination of a coefficient and an exponent (to the base of 2), many decimal fractions can only approximately be stored as a floating point number. Except for specific domains (such as finance), these inaccuracies are often not of much practical importance. But it explains why calculating 6*6/10 in Python returns 3.6, while 6*0.6 or 6*(6/10) returns 3.5999999999999996. Therefore, if a value can logically only be a whole number (anything that is countable, in fact), it makes sense to restrict it to an integer.

We also have a data type that is even more restricted and can take only two values: true or false. It is called “logical” (R) or “bool” (Python). Just notice that boolean values are case sensitive: while in R you must capitalize the whole value (TRUE, FALSE), in Python we only capitalize the first letter: True, False. As you can see in Example 3.3, such an object behaves exactly as an integer that is only allowed to be 0 or 1, and it can easily be converted to an integer.

Python code

d = 20 print(type(d)) # forcing python to treat 20 as a float d2 = 20.0 print(type(d2)) e = int(20.7) print(type(e)) print(e) f = True print(type(f)) print(int(f)) print(int(False))

R code

d = 20 print(class(d)) # forcing R to treat 20 as an int d2 = 20L print(class(d2)) e = as.integer(20.7) print(class(e)) print(e) f = TRUE print(class(f)) print(as.integer(f)) print(as.integer(FALSE))

Python output

<class 'int'> <class 'float'> <class 'int'> 20 <class 'bool'> 1 0

R output

[1] "numeric" [1] "integer" [1] "integer" [1] 20 [1] "logical" [1] 1 [1] 0

3.1.2.Storing Text

As a computational analyst of communication you will usually work with text objects or strings of characters. Commonly simply known as “strings”, such text objects are also referred to as “character vector objects” in R. Every time you want to analyze a social-media message, or any other text, you will be dealing with such strings.

Python code

text1 = "This is a text" print(f"Type of text1: {type(text1)}") text2 = "Using 'single' and \"double\" quotes" text3 = 'Using \"single\" and "double" quotes' print(f"Are text2 and text3 equal?{text2==text3}")

R code

text1 = "This is a text" glue("Class of text1: {class(text1)}") text2 = "Using 'single' and \"double\" quotes" text3 = 'Using \'single\' and "double" quotes' glue("Are text2 and text3 equal? {text2==text3}")

Python output

Type of text1: <class 'str'> Are text2 and text3 equal?False

R output

Class of text1: character Are text2 and text3 equal? TRUE

Python code

somebytes= text1.encode("utf-8") print(type(somebytes)) print(somebytes)

R code

somebytes= charToRaw(text1) print(class(somebytes)) print(somebytes)

Python output

<class 'bytes'> b'This is a text'

R output

[1] "raw" [1] 54 68 69 73 20 69 73 20 61 20 74 65 78 74

As you see in Example 3.4, you can create a string by enclosing text in quotation marks. You can use either double or single quotation marks, but you need to use the same mark to begin and end the string. This can be useful if you want to use quotation marks within a string, then you can use the other type to denote the beginning and end of the string. If you need to use a single quotation mark within a single-quoted string, you can escape the quotation mark by prepending it with a backslash (\'), and similarly for double-quoted strings. To include an actual backslash in a text, you also escape it with a backslash, so you end up with a double backslash (\\).

The Python example also shows a concept introduced in Python 3.6: the f-string. These are strings that are prefixed with the letter f and are formatted strings. This means that these strings will automatically insert a value where curly brackets indicate that you wish to do so. This means that you can write: print(f"The value of i is {i}") in order to print “The value of i is 5” (given that i equals 5). In R, the glue package allows you to use an f-string-like syntax as well: glue("The value of i is {i}").

Although this will be explained in more detail in Section 5.2.2 9.1, it is good to introduce how computers store text in memory or files. It is not too difficult to imagine how a computer internally handles integers: after all, even though the number may be displayed as a decimal number to us, it can be trivially converted and stored as a binary number (effectively, a series of zeros and ones) –- we do not have to care about that. But when we think about text, it is not immediately obvious how a string should be stored as a sequence of zeros and ones, especially given the huge variety of writing systems used for different languages.

Indeed, there are several ways of how textual characters can be stored as bytes, which are called encodings. The process of moving from bytes (numbers) to characters is called decoding, and the reverse process is called encoding. Ideally, this is not something you should need to think of, and indeed strings (or character vectors) already represent decoded text. This means that often when you read from or write data to a file, you need to specify the encoding (usually UTF-8). However, both Python and R also allow you to work with the raw data (e.g. before decoding) in the form of bytes (Python) or raw (R) data, which is sometimes necessary if there are encoding problems. This is shown briefly in the bottom part of var4. Note that while R shows the underlying hexadecimal byte values of the raw data (so 54 is T, 68 is h and so on) and Python displays the bytes as text characters, in both cases the underlying data type is the same: raw (non-decoded) bytes.

3.1.3.Combining Multiple Values: Lists, Vectors, And Friends

Until now, we have focused on the basic, initial data types or “vector objects”, as they are called in R. Often, however, we want to group a number of these objects. For example, we do not want to manually create thousands of objects called tweet0001, tweet0002, …, tweet9999 – we'd rather have one list called tweets that contains all of them. You will encounter several names for such combined data structures: lists, vectors, arrays, series, and more. The core idea is always the same: we take multiple objects (be it numbers, strings, or anything else) and then create one object that combines all of them (Example 3.5).

Python code

scores = [8, 8, 7, 6, 9, 4, 9, 2, 8, 5] print(type(scores)) countries = ["Netherlands", "Germany", "Spain"] print(type(countries))

R code

scores = c(8, 8, 7, 6, 9, 4, 9, 2, 8, 5) print(class(scores)) countries = c("Netherlands", "Germany", "Spain") print(class(countries))

Python output

R output

[1] "numeric" [1] "character"

As you see, we now have one name (such as scores) to refer to all of the scores. The Python object in Example 3.5 is called a list, the R object a vector. There are more such combined data types, which have slightly different properties that can be important to know about: first, whether you can mix different types (say, integers and strings); second, what happens if you change the array. We will discuss both points below and show how this relates to different specific types of arrays in Python and R which you can choose from. But first, we will show how to work with them.

Operations on vectors and lists One of the most basic operations you can perform on all types of one-dimensional arrays is indexing. It lets you locate any given element or group of elements within a vector using its or their positions. The first item of a vector in R is called 1, the second 2, and so on; in Python, we begin counting with 0. You can retrieve a specific element from a vector or list by simply putting the index between square brackets [] (Example 3.6).

Python code

scores = ["8","8","7","6","9","4","9","2","8","5"] print(scores[4]) print([scores[0], scores[9]]) print(scores[0:4]) # Convert the first 4 scores into numbers # Note the use of a list comprehension [.. for ..] # This will be explained in the section on loops scores_new = [int(e) for e in scores[1:4]] print(type(scores_new)) print(scores_new)

R code

scores=c("8","8","7","6","9","4","9","2","8","5") scores[5] scores[c(1, 10)] scores[1:4] # Convert the first 4 scores into numbers scores_new = as.numeric(scores[1:4]) class(scores_new) scores_new

Python output

9 ['8', '5'] ['8', '8', '7', '6'] <class 'list'> [8, 7, 6]

R output

[1] "9" [1] "8" "5" [1] "8" "8" "7" "6" [1] "numeric" [1] 8 8 7 6

In the first case, we asked for the score of the 5th student ("9"); in the second we asked for the 1st and 10th position ("8" "5"); and finally for all the elements between the 1st and 4th position ("8" "8" "7" "6"). We can directly indicate a range by using a :. After the colon, we provide the index of the last element (in R), while Python stops just before the index.[2] If we want to pass multiple single index values instead of a range in R, we need to create a vector of these indices by using c() (Example 3.6). Take a moment to compare the different ways of indexing between Python and R in Example 3.6!

Indexing is very useful to access elements and also to create new objects from a part of another one. The last line of our example shows how to create a new array with just the first four entries of scores and store them all as numbers. To do so, we use slicing to get the first four scores and then either change its class using the function as.numeric (in R) or convert the elements to integers one-by-one (Python) (Example 3.6).

Python code

# Appending a new value to a list: scores.append(7) # Create a new list instead of overwriting: scores4 = scores + [7] # Removing an entry: del scores[-10] # Creating a list containing various ranges list(range(1,21)) list(range(-5,6)) # A range of fractions: 0, 0.2, 0.4, ... 1.0 # Because range only handles integers, we first # make a range of 0, 2, etc, and divide by 10 my_sequence = [e/10 for e in range(0,11,2)]

R code

# appending a new value to a vector scores = c(scores, 7) # Create a new list instead of overwriting: scores4 = c(scores, 7) # removing an entry from a vector scores = scores[-10] # Creating a vector containing various ranges range1 = 1:20 range2 = -5:5 # A range of fractions: 0, 0.2, 0.4, ... 1.0 my_sequence = seq(0,1, by=0.2)

We can do many other things like adding or removing values, or creating a vector from scratch by using a function (Example 3.7). For instance, rather than just typing a large number of values by hand, we often might wish to create a vector from an operator or a function, without typing each value. Using the operator : (R) or the functions seq (R) or range (Python), we can create numeric vectors with a range of numbers.

Can we mix different types? There is a reason that the basic data types (numeric, character, etc.) we described above are called “vector objects” in R: The vector is a very important structure in R and consists of these objects. A vector can be easily created with the c function and can only combine elements of the same type (numeric, integer, complex, character, logical, raw). Because the data types within a vector correspond to only one class, when we create a vector with for example numeric data, the class function will display “numeric” and not “vector”.

If we try to create a vector with two different data types, R will force some elements to be transformed, so that all elements belong to the same class. For example, if you re-build the vector of scores with a new student who has been graded with the letter b instead of a number (Example 3.8), your vector will become a character vector. If you print it, you will see that the values are now displayed surrounded by ".

R code

scores2 = c(8, 8, 7, 6, 9, 4, 9, 2, 8, 5, "b") print(class(scores2)) print(scores2)

R output. Note that Python output may look slightly different

[1] "character" [1] "8" "8" "7" "6" "9" "4" "9" "2" "8" "5" "b"

In contrast to a vector, a list is much less restricted: a list does not care whether you mix numbers and text. In Python, such lists are the most common type for creating a one-dimensional array. Because they can contain very different objects, running the type function on them does not return anything about the objects inside the list, but simply states that we are dealing with a list (Example 3.5). In fact, lists can even contain other lists, or any other object for that matter.

In R you can also use lists, even though they are much less popular in R than they are in Python, because vectors are better if all objects are of the same type. R lists are created in a similar way as vectors, except that we have to add the word list before declaring the values. Let us build a list with four different kinds of elements, a numeric object, a character object, a square root function (sqrt), and a numeric vector (Example 3.9). In fact, you can use any of the elements in the list through indexing – even the function sqrt that you stored in there to get the square root of 16!

Python code

my_list = [33, "Twitter", np.sqrt, [1,2,3,4]] print(type(my_list)) # this resolves to sqrt(16): print(my_list[2](16))

R code

my_list = list(33, "Twitter", sqrt, c(1,2,3,4)) class(my_list) # this resolves to sqrt(16): my_list[[3]](16)

Python output

<class 'list'> 4.0

R output

[1] "list" [1] 4

Python users often like the fact that lists give a lot of flexibility, as they happily accept entries of very different types. But also Python users sometimes may want a stricter structure like R's vector. This may be especially interesting for high-performance calculations, and therefore, such a structure is available from the numpy (which stands for Numbers in Python) package: the numpy array. This will be discussed in more detail when we deal with data frames in Chapter 5.

Object references and mutable objects. A subtle difference between Python and R is how they deal with copying objects. Suppose we define $x$ containing the numbers $1,2,3$ (x=[1,2,3] in Python or x=c(1,2,3) in R) and then define an object $y$ to equal $x$ (y=x). In R, both objects are kept separate, so changing $x$ does not affect $y$, which is probably what you expect. In Python, however, we now have two variables (names) that both point to or reference the same object, and if we change $x$ we also change $y$ and vice versa, which can be quite unexpected. Note that if you really want to copy an object in Python, you can run x.copy(). See Example 3.10 for an example. Note that this is only important for mutable objects, that is, objects that can be changed. For example, lists in Python and R and vectors in R are mutable because you can replace or append members. Strings and numbers, on the other hand, are immutable: you cannot change a number or string, a statement such as x=x*2 creates a new object containing the value of x*2 and stores it under the name x.

Python code

x = [1,2,3] y = x y[0] = 99 print(x)

R code

x = c(1,2,3) y = x y[1] = 99 print(x)

Sets and Tuples The vector (R) and list (Python) are the most frequently used collections for storing multiple objects. In Python there are two more collection types you are likely to encounter. First, tuples are very similar to lists, but they cannot be changed after creating them (they are immutable). You can create a tuple by replacing the square brackets by regular parentheses: x=(1,2,3).

Second, in Python there is an object type called a set. A set is a mutable collection of unique elements (you cannot repeat a value) with no order. As it is not properly ordered, you cannot run any indexing or slicing operation on it. Although R does not have an explicit set type, it does have functions for the various set operations, the most useful of which is probably the function unique which removes all duplicate values in a vector. Example 3.11 shows a number of set operations in Python and R, which can be very useful, e.g. finding all elements that occur in two lists.

Python code

a = {3, 4, 5} my_list = [3, 2, 3, 2, 1] b = set(my_list) print(f"Set a: {a}; b: {b}") print(f"intersect: a & b = {a & b}") print(f"union: a | b = {a | b}") print(f"difference: a - b = {a - b}")

R code

a = c(3, 4, 5) my_vector = c(3, 2, 3, 2, 1) b = unique(my_vector) print(b) print(intersect(a,b)) print(union(a,b)) print(setdiff(a,b))

Python output

Set a: {3, 4, 5}; b: {1, 2, 3} intersect: a & b = {3} union: a | b = {1, 2, 3, 4, 5} difference: a - b = {4, 5}

R output

[1] 3 2 1 [1] 3 [1] 3 4 5 2 1 [1] 4 5

3.1.4.Dictionaries

Python dictionaries are a very powerful and versatile data type. Dictionaries contain unordered[3] and mutable collections of objects that contain certain information in another object. Python generates this data type in the form of {key : value} pairs in order to map any object by its key and not by its relative position in the collection. Unlike in a list, in which you index with an integer denoting the position in a list, you can index a dictionary using the key. This is the case shown in Example 3.12, in which we want to get the values of the object “positive” in the dictionary sentiments and of the object “A” in the dictionary grades. You will find dictionaries very useful in your journey as a computational scientist or practitioner, since they are flexible ways to store and retrieve structured information. We can create them using the curly brackets {} and including each key-value pair as an element of the collection (Example 3.12).

In R, the closest you can get to a Python dictionary is to use lists with named elements. This allows you to assign and retrieve values by key, however the key is restricted to names, while in Python most objects can be used as keys. You create a named list with d = list(name=value) and access individual elements with either d$name or d[["name"]].

Python code

sentiments = {"positive":1, "neutral" : 0, "negative" : -1} print(type(sentiments)) print("Sentiment for positive:", sentiments["positive"]) grades = {} grades["A"] = 4 grades["B"] = 3 grades["C"] = 2 grades["D"] = 1 print(f"Grade for A: {grades['A']}") print(grades)

R code

sentiments = list(positive=1, neutral=0, negative=-1) print(class(sentiments)) print(glue("Sentiment for positive: ", sentiments$positive)) grades = list() grades$A = 4 grades$B = 3 grades$C = 2 grades$D = 1 # Note: grades[["A"]] is equivalent to grades$A print(glue("Grade for A: {grades[['A']]}")) print(glue("Grade for A: {grades$A}")) print(grades)

Python output

<class 'dict'> Sentiment for positive: 1 Grade for A: 4 {'A': 4, 'B': 3, 'C': 2, 'D': 1}

R output

[1] "list" Sentiment for positive: 1 Grade for A: 4 Grade for A: 4 $A [1] 4 $B [1] 3 $C [1] 2 $D [1] 1

A good analogy for a dictionary is a telephone book (imagine a paper one, but it actually often holds true for digital phone books as well): the names are the keys, and the associated phone numbers the values. If you know someone's name (the key), it is very easy to look up the corresponding values: even in a phone book of thousands of pages, it takes you maybe 10 or 20 seconds to look up the name (key). But if you know someone's phone number (the value) instead and want to look up the name, that's very inefficient: you need to read the whole phone book until you find the number.

Just as the elements of a list can be of any type, and you can have lists of lists, you can also nest dictionaries to get dicts of dicts. Think of our phone book example: rather than storing just a phone number as value, we could store another dict with the keys “office phone”, “mobile phone”, etc. This is very often done, and you will come across many examples dealing with such data structures. You have one restriction, though: the keys in a dictionary (as opposed to the values) are not allowed to be mutable. After all, imagine that you could use a list as a key in a dictionary, and if at the same time, some other pointer to that very same list could just change it, this would lead to a quite confusing situation.

3.1.5.From One to More Dimensions: Matrices and $n$-Dimensional Arrays

Matrices are two-dimensional rectangular datasets that include values in rows and columns. This is the kind of data you will have to deal with in many analyses shown in this book, such as those related to machine learning. Often, we can generalize to higher dimensions.

Python code

matrix = [[1, 2, 3], [4, 5, 6], [7,8,9]] print(matrix) array2d = np.array(matrix) print(array2d)

R code

my_matrix = matrix(c(0, 0, 1, 1, 0, 1), nrow = 2, ncol = 3, byrow = TRUE) print(dim(my_matrix)) print(my_matrix) my_matrix2 = matrix(c(0, 0, 1, 1, 0, 1), nrow = 2, ncol = 3, byrow = FALSE) print(my_matrix2)

Python output

[[1, 2, 3], [4, 5, 6], [7, 8, 9]] [[1 2 3] [4 5 6] [7 8 9]]

R output

[1] 2 3 [,1] [,2] [,3] [1,] 0 0 1 [2,] 1 0 1 [,1] [,2] [,3] [1,] 0 1 0 [2,] 0 1 1

In Python, the easiest representation is to simply construct a list of lists. This is, in fact, often done, but has the disadvantage that there are no easy ways to get, for instance, the dimensions (the shape) of the table, or to print it in a neat(er) format. To get all that, one can transform the list of lists into an array, a datastructure provided by the package numpy (see Chapter 5 for more details).

To create a matrix in R, you have to use the function matrix and create a vector of values with the indication of how many rows and columns will be on it. We also have to tell R if the order of the values is determined by the row or not. In Example 3.13, we create two matrices in which we vary the byrow argument to be TRUE and FALSE, respectively, to illustrate how it changes the values of the matrix, even when the shape ($2 \times3$) remains identical. As you may imagine, we can operate with matrices, such as adding up two of them.

3.1.6.Making Life Easier: Data Frames

So far, we have discussed the general built-in collections that you find in most programming languages such as the list and array. However, in data science and statistics you are very likely to encounter a specific collection type that we haven't discussed yet: the Data frame. Data frames are discussed in detail in Chapter 5, but for completeness we will also introduce them briefly here.

Data frames are user-friendly data structures that look very much like what you find in SPSS, Stata, or Excel. They will help you in a wide range of statistical analysis. A data frame is a tabular data object that includes rows (usually the instances or cases) and columns (the variables). In a three-column data frame, the first variable can be numeric, the second character and the third logical, but the important thing is that each variable is a vector and that all these vectors must be of the same length. We create data frames from scratch using the data.frame() function. Let’s generate a simple data frame of three instances (each case is an author of this book) and three variables of the types numeric (age), character (country where they obtained their master degree) and logic (living abroad, whether they currently live outside the country in which they were born) (Example 3.14). Notice that you have the label of the variables at the top of each column and that it creates an automatic numbering for indexing the rows.

Python code

authors = pd.DataFrame({"age": [38, 36, 39], "countries": ["Netherlands","Germany","Spain"], "living_abroad": [False, True, True]}) print(authors)

R code

authors = data.frame(age = c(38, 36, 39), countries = c("Netherlands","Germany","Spain"), living_abroad= c(FALSE, TRUE, TRUE)) print(authors)

Python output. Note that R output may look slightly different

age countries living_abroad 0 38 Netherlands False 1 36 Germany True 2 39 Spain True

3.2.Simple Control Structures: Loops and Conditions

Control structures in Python and R. This section and the next explain the working of control structures such as loops, conditions, and functions. These exist (and are very useful) in both Python and R. In R, however, you do not need them as much because most functions can work on whole columns in one go, while in Python you often run things on each row of a column and sometimes do not use data frames at all. Thus, if you are primarily interested in using R you could consider skipping the remainder of this chapter for now and returning later when you are ready to learn more. If you are learning Python, we strongly recommend continuing with this chapter, as control structures are used in many of the examples in the book.

Having a clear understanding of objects and data types is a first step towards comprehending how object-orientated languages such as R and Python work, but now we need to get some literacy in writing code and interacting with the computer and the objects we created. Learning a programming language is just like learning any new language. Imagine you want to speak Italian or you want to learn how to play the piano. The first thing will be to learn some words or musical notes, and to get familiarized with some examples or basic structures – just as we did in Chapter 2. In the case of Italian or the piano, you would then have to learn some grammar: how to form sentences, how play some chords; or, more generally, how to reproduce patterns. And this is exactly how we now move on to acquiring computational literacy: by learning some rules to make the computer do exactly what you want.

Remember that you can interact with R and Python directly on their consoles just by typing any given command. However, when you begin to use several of these commands and combine them you will need to put all these instructions into a script that you can then run partially or entirely. Recall Section 1.4, where we showed how IDEs such as RStudio (and Pycharm) offer both a console for directly typing single commands and a larger window for writing longer scripts.

Both R and Python are interpreted languages (as opposed to compiled languages), which means that interacting with them is very straightforward: You provide your computer with some statements (directly or from a script), and your computer reacts. We call a sequence of these statements a computer program. When we created objects by writing, for instance, a = 100, we already dealt with a very basic statement, the assignment statement. But of course the statements can be more complex.

In particular, we may want to say more about how and when statements need to be executed. Maybe we want to repeat the calculation of a value for each item on a list, or maybe we want to do this only if some condition is fulfilled.

Both R and Python have such loops and conditional statements, which will make your coding journey much easier and with more sophisticated results because you can control the way your statements are executed. By controlling the flow of instructions you can deal with a lot of challenges in computer programming such as iterating over unlimited cases or executing part of your code as a function of new inputs.

In your script, you usually indicate such loops and conditions visually by using indentation. Logical empty spaces – two in R and four in Python – depict blocks and sub-blocks on your code structure. As you will see in the next section, in R, using indentation is optional, and curly brackets will indicate the beginning ({) and end (}) of a code block; whereas in Python, indentation is mandatory and tells your interpreter where the block starts and ends.

3.2.1.Loops

Loops can be used to repeat a block of statements. They are executed once, indefinitely, or until a certain condition is reached. This means that you can operate over a set of objects as many times as you want just by giving one instruction. The most common types of loops are for, while, and repeat (do-while), but we will be mostly concerned with so-called for-loops. Imagine you have a list of headlines as an object and you want a simple script to print the length of each message. Of course you can go headline by headline using indexing, but you will get bored or will not have enough time if you have thousands of cases. Thus, the idea is to operate a loop in the list so you can get all the results, from the first until the last element, with just one instruction. The syntax of the for-loop is:

Python code

for val in sequence: statement1 statement2 statement3

R code

for (val in sequence) { statement1 statement2 statement3 }

As Example 3.15 illustrates, every time you find yourself repeating something, for instance printing each element from a list, you can get the same results easier by iterating or looping over the elements of the list, in this case. Notice that you get the same results, but with the loop you can automate your operation writing few lines of code. As we will stress in this book, a good practice in coding is to be efficient and harmonious in the amount of code we write, which is another justification for using loops.

Python code

headlines = ["US condemns terrorist attacks", "New elections forces UK to go back to the UE", "Venezuelan president is dismissed"] # Manually counting each element print("manual results:") print(len(headlines[0])) print(len(headlines[1])) print(len(headlines[2])) #and the second is using a for-loop print("for-loop results:") for x in headlines: print(len(x))

R code

headlines = list("US condemns terrorist attacks", "New elections forces UK to go back to the UE", "Venezuelan president is dismissed") # Manually counting each element print("manual results: ") print(nchar(headlines[1])) print(nchar(headlines[2])) print(nchar(headlines[3])) # Using a for-loop print("for-loop results:") for (x in headlines){ print(nchar(x)) }

Python output

manual results: 29 44 33 for-loop results: 29 44 33

R output

[1] "manual results: " [1] 29 [1] 44 [1] 33 [1] "for-loop results:" [1] 29 [1] 44 [1] 33

Don't repeat yourself! You may be used to copy-pasting syntax and slightly changing it when working with some statistics program: you run an analysis and then you want to repeat the same analysis with different datasets or different specifications. But this is error-prone and hard to maintain, as it involves a lot of extra work if you want to change something. In many cases where you find yourself pasting multiple versions of your code, you would probably be better using a for-loop instead.

Another way to iterate in Python is using list comprehensions (not available natively in R), which are a stylish way to create list of elements automatically even with conditional clauses. This is the syntax:

newlist = [expression for item in list if conditional]

In Example 3.16 we provide a simple example (without any conditional clause) that creates a list with the number of characters of each headline. As this example illustrates, list comprehensions allow you to essentially write a whole for-loop in one line. Therefore, list comprehensions are very popular in Python.

Python code

len_headlines= [len(x) for x in headlines] print(len_headlines) # Note: the "list comprehension" above is # equivalent to the more verbose code below: len_headlines = [] for x in headlines: len_headlines.append(len(x)) print(len_headlines)

Python output. Note that R output may look slightly different

[29, 44, 33] [29, 44, 33]

3.2.2.Conditional Statements

Conditional statements will allow you to control the flow and order of the commands you give the computer. This means you can tell the computer to do this or that, depending on a given circumstance. These statements use logic operators to test if your condition is met (True) or not (False) and execute an instruction accordingly. Both in R and Python, we use the clauses if, else if (elif in Python), and else to write the syntax of the conditional statements. Let's begin showing you the basic structure of the conditional statement:

Python code

if condition: statement1 elif other_condition: statement2 else: statement3

R code

if (condition) { statement1 } else if (other_condition) { statement2 } else { statement3 }

Suppose you want to print the headlines of Example 3.15 only if the text is less than 40 characters long. To do this, we can include the conditional statement in the loop, executing the body only if the condition is met (Example 3.17)

Python code

for x in headlines: if len(x)<40: print(x)

R code

for (x in headlines){ if (nchar(x)<40) { print(x)} }

Python output. Note that R output may look slightly different

US condemns terrorist attacks Venezuelan president is dismissed

We could also make it a bit more complicated: first check whether the length is smaller than 40, then check whether it is exactly 44 (elif / else if), and finally specify what to do if none of the conditions was met (else).

In Example 3.18, we will print the headline if it is shorter than 40 characters, print the string “What a coincidence!” if it is exactly 44 characters, and print “Too Low” in all other cases. Notice that we have included the clause elif in the structure (in R it is noted else if). elif is a combination of else and if: if the previous condition is not satisfied, this condition is checked and the corresponding code block (or else block) is executed. This avoids having to nest the second if within the else, but otherwise the reasoning behind the control flow statements remains the same.

Python code

for x in headlines: if len(x)<30: print(x) elif len(x) == 44: print("What a coincidence!") else : print ("Too low")

R code

for (x in headlines) { if (nchar(x)<30) { print(x) } else if (nchar(x)==44) { print("What a coincidence!") } else { print("Too low") } }

Python output. Note that R output may look slightly different

US condemns terrorist attacks What a coincidence! Too low

3.3.Functions and Methods

Functions and methods are fundamental concepts in writing code in object-orientated programming. Both are objects that we use to store a set of statements and operations that we can use later without having to write the whole syntax again. This makes our code simpler and more powerful.

We have already used some built-in functions, such as length and class (R) and len and type (Python) to get the length of an object and the class to which it belongs. But, as you will learn in this chapter, you can also write your own functions. In essence, a function takes some input (the arguments supplied between brackets) and returns some output. Methods and functions are very similar concepts. The difference between them is that the functions are defined independently from the object, while methods are created based on a class, meaning that they are associated with an object. For example, in Python, each string has an associated method lower, so that writing 'HELLO'.lower() will return 'hello'. In R, in contrast, one uses a function, tolower('HELLO'). For now, it is not really important to know why some things are implemented as a method and some are implemented as a function; it is partly an arbitrary choice that the developers made, and to fully understand it, you need to dive into the concept of classes, which is beyond the scope of this book.

Tab completion. Because methods are associated with an object, you have a very useful trick at your disposal to find out which methods (and other properties of an object) there are: TAB completion. In Jupyter, just type the name of an object followed by a dot (e.g., a.<TAB> in case you have an object called a) and hit the TAB key. This will open a drop-down menu to choose from.

We will illustrate how to create simple functions in R and Python, so you will have a better understanding of how they work. Imagine you want to create two functions: one that computes the 60% of any given number and another that estimates this percentage only if the given argument is above the threshold of 5. The general structure of a function in R and Python is:

Python code

def f(par1, par2=0): statements return return_value result = f(arg1, arg2) result = f(par1=arg1, par2=arg2) result = f(arg1, par2=arg2) result = f(arg1)

R code

f = function(par1, par2=0) { statements return_value } result = f(arg1, arg2) result = f(par1=arg1, par2=arg2) result = f(arg1, par2=arg2) result = f(arg1)

In both cases, this defines a function called f, with two arguments, arg_1 and arg_2. When you call the function, you specify the values for these parameters (the arguments) between brackets after the function name. You can then store the result of the function as an object as normal.

As you can see in the syntax above, you have some choices when specifying the arguments. First, you can specify them by name or by position. If you include the name (f(param1=arg1)) you explicitly bind that argument to that parameter. If you don't include the name (f(arg1, arg2)) the first argument matches the first parameter and so on. Note that you can mix and match these choices, specifying some parameters by name and others by position.

Second, some functions have optional parameters, for which they provide a default value. In this case, par2 is optional, with default value 0. This means that if you don't specify the parameter it will use the default value instead. Usually, the mandatory parameters are the main objects used by the function to do its work, while the optional parameters are additional options or settings. It is recommended to generally specify these options by name when you call a function, as that increases the readability of the code. Whether to specify the mandatory arguments by name depends on the function: if it's obvious what the argument does, you can specify it by position, but if in doubt it's often better to specify them by name.

Finally, note that in Python you explicitly indicate the result value of the function with return value. In R, the value of the last expression is automatically returned, although you can also explicitly call return(value).

Example 3.19 shows how to write our function and how to use it.

Python code

#The first function just computes 60% of the value def perc_60(x): return x*0.6 print(perc_60(10)) print(perc_60(4)) # The second function only computes 60% it the # value is bigger than 5 def perc_60_cond(x): if x>5: return x*0.6 else: return x print(perc_60_cond(10)) print(perc_60_cond(4))

R code

#The first function just computes 60% of the value perc_60 = function(x) x*0.6 print(perc_60(10)) print(perc_60(4)) # The second function only computes 60% it the # value is bigger than 5 perc_60_cond = function(x) { if (x>5) { return(x*0.6) } else { return(x) } } print(perc_60_cond(10)) print(perc_60_cond(4))

Python output. Note that R output may look slightly different

6.0 2.4 6.0 4

The power of functions, though, lies in scenarios where they are used repeatedly. Imagine that you have a list of 5 (or 5 million!) scores and you wish to apply the function perc_60_cond to all the scores at once using a loop. This costs you only two extra lines of code (Example 3.20).

Python code

# Apply the function in a for-loop scores = [3,4,5,7] for x in scores: print(perc_60_cond(x))

R code

# Apply the function in a for-loop scores = list(3,4,5,6,7) for (x in scores) { print(perc_60_cond(x)) }

Python output. Note that R output may look slightly different

3 4 5 4.2

A specific type of Python function that you may come across at some point (for instance, in Section 12.2.2) is the generator. Think of a function that returns a list of multiple values. Often, you do not need all values at once: you may only need the next value at a time. This is especially interesting when calculating the whole list would take a lot of time or a lot of memory. Rather than waiting for all values to be calculated, you can immediately begin processing the first value before the next arrives; or you can work with data so large that it doesn't all fit into your memory at the same time. You recognize a generator by the yield keyword instead of a return keyword (Example 3.21)

Python code

mylist = [35,2,464,4] def square1(somelist): listofsquares = [] for i in somelist: listofsquares.append(i**2) return(listofsquares) def square2(somelist): for i in somelist: yield i**2 print("As a list:") mysquares = square1(mylist) for mysquare in mysquares: print(mysquare) print(type(mysquares)) print(f"The list has {len(mysquares)} entries") print("\nAs a generator:") mysquares = square2(mylist) for mysquare in mysquares: print(mysquare) print(type(mysquares)) # This throws an error (generators have no length) print(f"mysquares has {len(mysquares)} entries")

Python output. Note that R output may look slightly different

As a list: 1225 4 215296 16 <class 'list'> The list has 4 entries As a generator: 1225 4 215296 16 <class 'generator'>

So far you have taken your first steps as a programmer, but there are many more advanced things to learn that are beyond the scope of this book. You can find a lot of literature, online documentation and even wonderful Youtube tutorials to keep learning. We can recommend the books by Crawley (2012) and VanderPlas (2016) to have more insights into R and Python, respectively. In the next chapter, we will go deeper into the world of code in order to learn how and why you should re-use existing code, what to do if you get stuck during your programming journey and what are the best practices when coding.

Page 2

« Ch. 15 Scaling up and distributing

Abstract This chapter summarizes the main learning goals of the book, and outlines possible next steps. Special attention is paid to an ethical application of computational methods, as well as to the importance of open and transparent science.

Keywords: summary, open science, ethics

Chapter objectives:

Reflect on the learning goals of the book
Point out avenues for future study
Highlight ethical considerations for applying the techniques covered in the book
Relate the techniques covered in the book to Open Science practices

This concluding chapter provides a broad overview of what was covered so far, and what interesting avenues there are to explore next. It gives pointers to resources to learn more about topics such as programming, statistical modeling or deep learning. It also discusses considerations regarding ethics and open science.

In this book, we introduced you to the computational analysis of communication. In Chapter 1, we tried to convince you that the computational analysis of communication is a worthwhile endeavor – and we also highlighted that there is much more to the subject than this book can cover. So here we are now. Maybe you skipped some chapters, maybe you did some additional reading or followed some online tutorials, and maybe you completed your first small project that involved some of techniques we covered. Time to recap.

You now have some knowledge of programming. We hope that this has opened new doors for you, and allows you to use a wealth of libraries, tutorials, and tools that may make your life easier, your research more productive, and your analyses better.

You have learned how to handle new types of data. Not only traditional tabular datasets, but also textual data, semi-structured data, and to some extent network data and images.

You can apply machine-learning frameworks. You know about both unsupervised and supervised approaches, and can decide how they can be useful for finding answers to your research questions.

Finally, you have got at least a first impression of some cool techniques like neural networks and services such as databases, containers, and cloud computing. We hope that being aware of them will help you to make an informed decision whether they may be good tools to dive into for your upcoming projects.

16.2.Where To Go Next?

But what should you learn next?

Most importantly, we cannot stress enough that it should be the research question that is the guide, not the method. You shouldn't use the newest neural network module just because it's cool, when counting occurrences of a simple regular expression does the trick. But this also applies the other way around: if a new method performs much better than an old one, you should learn it! For too long, for instance, people have relied on simple bag-of-words sentiment analyses with off-the-shelf dictionaries, simply because they were easy to use – despite better alternatives being available.

Having said that, we will nevertheless try to give some general recommendations for what to learn next.

Become better at programming In this book, we tried to find a compromise between teaching the programming concepts necessary to apply our methods on the one hand, and not getting overly technical on the other hand. After all, for many social scientists, programming is a means to an end, not a goal in itself. But as you progress, a deeper understanding of some programming concepts will make it easier for you to tailor everything according to your needs, and will – again – open new doors. There are countless books and online tutorials on “Programming in [Language of your choice]”. In fact, in this “bilingual” book we have shown you how to program with R and Python (the most used languages by data scientists), but there are other programming languages that might also deserve your attention (e.g. Java, Scala, Julia, etc.) if you become a computational scientist.

Learn how to write libraries A very specific yet widely applicable skill we'd encourage you to learn is writing your own packages (“modules” or “libraries”). One of the nice things about computational analyses is that they are very much compatible with an Open Science approach. Sharing what you have done is much easier if everything you did is already documented in some code that you can share. But you can go one step further: of course it is nice if people can exactly reproduce your analysis, but wouldn't it be even nicer if they could also use your code to run analyses using their own data? If you thought about a great way to compute some statistic, why not make it easy for others to do the same? Consider writing (and documenting!) your code in a general way and then publishing it on CRAN or pypi so others can easily install and use it.

Get inspiration for new types of studies Try to think a bit out of the box and beyond classical surveys, experiments, and content analyses to design new studies. Books like Bit by BitSalganik (2019) may help you with this. You can also take a look at other scientific disciplines such as computational biology that has reinvented its methods, questions and hypotheses. Keep in mind that computational methods have an impact on the theoretical and empirical discussions of communication processes, which in turn will call for novel types of studies. The emerging scientific fields such as Computational Communication Science, Computational Social Sciences and Digital Humanities show how theory and methods can develop hand in hand.

Get a deeper understanding of deep learning For many tasks in the computational analysis of communication, classical machine learning approaches (like regression or support vector machines) work just fine. In fact, there is no need to always jump on the latest band wagon of the newest technique. If a simple logistic regression achieves an F1-score of 88.1, and the most fancy neural network achieves an 88.5 – would it be worth the extra effort and the loss of explainability? It depends on your use case, but probably not. Nevertheless, by now, we can be fairly certain that neural networks and deep learning are here to stay. We could only give a limited introduction in this book, but state-of-the-art analysis of text and especially visual material cannot do without it any more. Even though you may not train such models yourself all the time, but may use, for instance, pre-trained word embeddings or use packages like spacy that have been trained using neural networks, it seems worthwhile to understand these techniques better. Also here, a lot of online tutorials exist for frameworks such as keras or tensorflow, but also thorough books that provide a sound understanding of the underlying models Goldberg (2017).

Learn more about statistical models Not everything in the computational analysis of communication is machine learning. We used the analogy of the mouse trap (where we only care about the performance, not the underlying mechanism) versus better prior understanding, and argued that often, we may use machine learning as a “mouse trap” to enrich our data – even if we are ultimately interested in explaining some other process. For instance, we may want to use machine learning as one step in a workflow to predict the topic of social media messages, and then use a conventional statistical approach to understand which factors explain how often the message has been shared. Such data, though, often have different characteristics than data that you may encounter in surveys or experiments. In this case, for instance, the number of shares is a so-called count variable: it can take only positive integers, and thus has a lower bound (0) but no upper bound. That's very different than normally distributed data and requires regression models such as negative binomial regression. That's not difficult to do, but worth reading up on. Similarly, multilevel modelling will often be appropriate for the data you work with. Being familiar with this and other techniques (such as mediation and moderation analysis, or even structural equation modeling) will allow you to make better choices. On a different note, you may want to familiarize yourself with Bayesian statistics – a framework that is very different from the so-called frequentist approach that you probably know from your statistics courses.

And, last but not least: have fun! At least for us, that is one of the most important parts: don't forget to enjoy the skills you gained, and create projects that you enjoy!

16.3.Open, Transparent, and Ethical Computational Science

We started this book by reflecting on what we are actually doing when conducting computational analyses of communication. One of the things we highlighted in Chapter 1 was our use of open-source tools, in particular Python and R and the wealth of open-source libraries that extend them. Hopefully, you have also realized not only how much your work could therefore build on the work of others, but also how many of the resources you used were created as a community effort.

Now that you acquired the knowledge it takes to conduct computational research on communication, it is time to reflect on how to give back to the community, and how to contribute to an open research environment. At the same time, it is not as simple as “just putting everything online” – after all, researchers often work with sensitive data. We therefore conclude this book with a short discussion on open, transparent, and ethical computational science.

Transparent and Open Science In the wake of the so-called reproducibility crisis, the call for transparent and open science has become louder and louder in the last years. The public, funders, and journals increasingly ask for access to data and analysis scripts that underly published research. Of course, publishing your data and code is not a panacea for all problems, but it is a step towards better science from at least two perspectives Van Atteveldt et al., 2019: first, it allows others to reproduce your work, enhancing its credibility (and the credibility of the field as a whole). Second, it allows others to build on your work without reinventing the wheel.

So, how can you contribute to this? Most importantly, as we advised in Section 4.3: use a version control system and share your code on a site like github.com. We also discussed code-sharing possibilities in Section 15.3. Finally, you can find a template for organizing your code and data so that your research is easy to reproduce at github.com/ccs-amsterdam/compendium.

The privacy–transparency trade-off While the sharing of code is not particularly controversial, the sharing of data sometimes is. In particular, you may deal with data that contain personally identifiable information. On the one hand, you should share your data to make sure that your work can be reproduced – on the other hand, it would be ethically (and depending on your jurisdiction, potentially also legally) wrong to share personal data about individuals. As boyd and Crawford (2012) write: “Just because it is accessible does not make it ethical.” Hence, the situation is not always black or white, and some techniques exist to find a balance between the two: you can remove (or hash) information such as usernames, you can aggregate your data, you can add artificial noise. Ideally, you should integrate legal, ethical, and technical considerations to make an informed decision on how to find a balance such that transparency is maximized while privacy risks are minimized. More and more literature explores different possibilities Breuer et al. (2020).

Other Ethical Challenges in Computational Analyses Lastly, there are also other ethical challenges that go beyond the use of privacy-sensitive data. Many tools we use give us great power, and with that comes great responsibility. For instance, as we highlighted in Section 12.4, every time we scrape a website, we cause some costs somewhere. They may be neglectable for a single http request, but they may add up. Similarly, calculations on some cloud service cause environmental costs. Before starting a large-scale project, we should therefore make a trade-off between the costs or damage we cause, and the (scientific) gain that we achieve.

In the end, though, we firmly believe that as computational scientists, we are well-equipped to contribute to the move towards more ethical, open, and transparent science. Let's do it!