hupili / python-for-data-and-media-communication-gitbook Goto Github PK
View Code? Open in Web Editor NEWAn open source book on Python tailed for communication students with zero background
An open source book on Python tailed for communication students with zero background
New challenge is added: https://github.com/hupili/python-for-data-and-media-communication-gitbook/blob/master/notes-week-03.md#simple-word-frequency
in
operator missing. Please add to the corresponding section.Welcome to paste your answers below!
The Example 21 in Week3 tutorial is helpful for us to learn how the For loop works. But I also find that the same result can be achieved with the While loop as well. Below are the codes that I write to share with you guys for reference and suggestions. You can also check the results in nbviewer here.
Fixed_Cost = 30000
Content_Cost = 70000
member_ff = 15
convert_rate = 0.1
ad_revenue_each_person = 1
subscribers=0
net_income=0.1*subscribers*member_ff+ad_revenue_each_person*subscribers-Fixed_Cost-Content_Cost
if net_income==0:
print(subscribers)
else:
while subscribers>=0:
subscribers=subscribers+1
if subscribers<50000:
net_income=0.1*subscribers*member_ff+ad_revenue_each_person*subscribers-Fixed_Cost-Content_Cost
else:
net_income=0.1*subscribers*member_ff+ad_revenue_each_person*subscribers-Fixed_Cost-Content_Cost-0.1*(subscribers-50000)
if net_income==0:
break
print(subscribers)
Here are the instruction of how to install jupyter generally, we typing following command to create virtual environment then install all dependencies and modules. But in lab, there are some problem:
pyvenv venv
source venv/bin/activate
pip3 install jupyter
You will get the error of upgrade your pip
version during install jupyter, just copy the command.
pip install --upgrade pip
Then, install jupyter, open the jupyter notebook
pip3 install jupyter
jupyter notebook
then you will encounter another problem, whenever you execute what, it shows its in running and there is a sign saying that kernel starting, please wait...
The reason can be found in the terminal
that when you install jupyter, there is an red line saying:
ipython 7.0.1 has requirement prompt-toolkit<2.1.0,>=2.0.0, but you'll have prompt-toolkit 1.0.15 which is incompatible.
In oder to solve this, there is two solutions which can be found in jupyter_kernel issue#158
solution 1: pip install 'ipykernel<5.0.0'
solution 2: first downgraded ipython pip install -U ipython==6.5.0
, then prompt-tookit pip install -U prompt-toolkit==1.0.15
Finally, open jupyter notebook, type something to test, it should work now!
Here is the file naming convention:
Please do a quick scan in the assets
folder and change the file names and their corresponding references in the markdown files.
It is better explain to students Jupyter notebook would not present results on the screen except using "print()" command or type the variable in the new line. May causing confusion if we dont mention it.
my_urls = data.find_all('a',attrs={'class':'post__title-link js-read-more'})
I am wondering why the value here is 'post__title-link js-read-more'? I only inspect this:
「消失的檔案」:香港開放數據大檢討
Why not is this one: class="post__title-link"
One thing I can not clearly understand is the format part:
my_url['href'].split('/blog')[-1]
Can you try to explain a little bit how can this return the format we want?
for i in range(1,8): #format all pages urls
if i == 1:
page_url = url
else:
page_url = '{url_initial}page/{number}/'.format(url_initial = url,number=i)
#print(page_url)
Here using the if logic, if page ==1, then the page_url is url. If else, still need to be further formatted. My question is why we can not apply the same function to other pages? Aren't they sharing the same format?
Thanks for your hard work! Yes, this week is more difficult than before, but I think it is clear to follow. : )
Feedback1:
I got an Error when trying to install BeautifulSoup using the first codes mentioned as follows
The output is as follows:
Following are some permalinks for codes and data.
in the Example 13:
test3 = 'python loves,'you''
test3.find('you')
8 #returns the first character where 'you' begins
Is that should be 8? I think it may be 14 instead?
scraper-example
folders. Create new files and don't override previous openrice example.pandas
basics. Note that there is a bit duplicate between ch7 and ch8 (1D part). This is normal because most of the time you observe errors during analysis. So analysis skills can also help data cleaning.Example: I get IOError when running my script to load files.
Example:
open('path-to-a-file-not-exist')
Example:
I wrote this programme to allocate cases for each five students. But it seems something goes wrong with the index on the line " print(list2[s])" and "print(list1[c:c+5])". How can I give value 0 to s and c, also change the values each time.
student_list =[
18421111,
18421112,
18421113,
18421114,
18421115,
18421116,
18421117,
18421118,
18421119,
18421120,
18421121,
18421122,
18421123,
18421124,
18421125,
18421126,
18421127,
18421128,
18421129,
18421130,
18421131,
18421132,
18421133,
18421134,
18421135,
18421136,
18421137,
18421138,
18421139,
18421140,
18421141,
18421142,
18421143,
18421144,
18421145,
18421146,
18421147,
18421148,
18421149,
18421150,
18421151,
18421152,
18421153,
18421154,
18421155,
18421156,
18421157,
18421158,
18421159,
18421160,
]
case_list =[
'case1 - build a calculator to evaluate your business model',
'case2 - build a automatic earthquake robot to broadcast the new earthquake',
'case3 - evaluate social media performance of a luxury brand',
'case4 - study movie blockbuster 'Dying to Survive'',
'case5 - invest your money like the Internet giant, Tencent',
'case6 - where are the 200,000 inferior vaccines flowing?',
'case7 - study classics, Who control the discourse power in 'Dream of the Red Chamber'',
'case8 - research about Didi-driver crimes in China',
'case9 - 'Me too' analysis',
'case10 - what is hip-hop in china?'
]
import random
random.shuffle(student_list)
list1=student_list
print(list1)
random.shuffle(case_list)
list2=case_list
print(list2)
s=0
c=0
for s in student_list:
print(list2[s])
s=s+1
for c in case_list:
print(list1[c:c+5])
c=c+5
Python is a dynamic language which makes error checking at compile time hard. Most error is exposed at run time. However, some tools can still help us to catch most of the errors and try to write best practice code. "Linting" is a general concept found in all programming languages that refers to the process to identify potential errors and suboptimal practices at the writing time.
Go to https://apps.twitter.com/app/new and click 'Apply for a developer account'.
Choose “Personal use” & Enter your desired Application Name, Primary country (hk) and so on.
The most important is : Describe in your own words what you are buiding
you have to answer each question as detailed as possible. the following is an example:
1. I’m using Twitter’s APIs to practice my data collection and analysis skills in the Big Data Analysis course. I am currently a postgraduate student in Hong Kong Baptist University, majoring in Communication, and normally collect users comment data for mass communication study research.
2. As for the methods and techniques I plan to conduct, here is our course GitHub open book. You can take it as reference. https://github.com/hupili/python-for-data-and-media-communication-gitbook/blob/master/notes-week-04.md#use-api-via-function-calls-to-other-modules-packages
3. I only use it to do data collection practice, will not use case to tweeting, retweeting or liking content.
4. Tweets will be displayed on our final project presentation for academic use.
5. Finally, I will comply with the Policies of twitter.
I am looking forward to your favorable reply.
and then, submit your application! ->verify your email
and then, your application may under review or! jump to a new page :
This Chapter is really wonderful. Although I've already got a bit of fundamental knowledge about python from other channel, I also receive new inspirations and flexible programming usage because it give more practical examples and transmit an important concept-use it in daily life.
Additionally, to be a fresh learner, I also have some suggestions and questions to this Chapter.
>>> b = np.array([[1,2,3],[4,5,6]]) # Create a rank 2 array
>>> print(b.shape)
(2, 3)
>>> print(b[0, 0], b[0, 1], b[1, 0])
1 2 4
>>> import numpy as np
>>> a = np.array([1, 2, 3]) # Create a rank 1 array
>>> print(a[0], a[1], a[2]) # index elements
Thus, I think there are two effective ways.
I have to admit that this chapter is really more difficult, but also useful, which help me to build logic thinking for scraper and how to achieve it in practical application step by step. The feedback and questions are listed below according to chapter order.
1. In Get data:
(1) You seem to have lost the underline inmy_title = myh1.text
. Without underline '_', this code will face NameError.
(2) Still in this code, I cannot understand what Type(myh1) means ? Maybe we can change the myh1 to h2 such as my_h1 = data.find('h2')
and get output '話癆特朗普', which might help others to understand the target that using tag and attributes to extract the data we want directly.
2. In Get author try 2
(1) How do we determine the tag_name? Just like the 'tr', I'm wondering the regulations because you use 'a' as tag_name in the latter function scrape_articles_urls_of_one_page
(2)This code seem likes a dictattrs={'class':"post__authors"}
, why use this format, could you explain it more detailed, or is it just syntax rules?
Thanks for all your work and help, it's meaningful !
The outputs in Example4 should contain no parentheses and no quotation marks.
Below is the right version:
x=4
y=6
print('x!=y:',x!=y)
x!=y: True
Remember to add parentheses, and list2[1:5]means slicing list2 from index1 value to index5 value but does not include index5 value. Also, list2[:2] means slcing list2 from index0 value to index2 value but does not include index2 value. We had better explain the rule of slicing lists more clearly. It confused me when I saw the outputs in the first place.
In Example10, there is no key named 'Frank' or its corresponding value.
If we try to access the value of a key that does not exist in the list, an error will be reported as follows:
In Example22, the last line(i=i+1)should only have one indentation, otherwise the output will be 1 endlessly and never breaks:
Below is the right version:
In the first line, the input('please input a int:') will produce a string, and the type of a string is never equal to the type of 1. Hence, ValueError will be always raised if you put either 2 or 2.2. The 2 and 2.2 you input are strings. Therefore, you need to add a int() function before the input(), because int('2.2') leads to a ValueError while int('2') can produce an int.
Also, in the last line, remember to add parenthesis: print (inputValue)
The right version should be:
For the example in week5 which extracts all article urls in http://initiumlab.com, all the tags including the article urls have been collected like below
Next, we need to get the strings after 'href' (e.g.'href="../blog/20170113-Sharing-With-Friends-Versus-Strangers/"'), and the codes are as below:
for my_url in my_urls:
url ='{0}{1}'.format('http://initiumlab.com',my_url['href'][2:]) #format urls
#print(url)
article_urls.append(url)
article_urls
However, I do not quite understand what my_url['href'][2:]
means here. 'my_url['href']' seems to be a function of finding the value through the corresponding key in a dictionary, but 'my_url' is a list element. [2:] seems to be a function of extracting part of words in string. I feel a little bit confused.
week00- gh desktop
How to create a file in "example" repository after create a new repo.
don't understand the "drag"part. Thank you!
It seems that this week has large load of content! Just some humble suggestions:
in the Str Comparison part we give an example: Name1 == Name3, then return the bool value is True. Is it necessary to explain the "equal" true meaning? Since if we try "Name1 is Name3", the result would be False. So the '==' just means the two items have equal content here, doesn't mean they are the same thing.
in the List[] part:
A little bit confused about "remove()" and "pop()"...What are the differences? Could you explain in additional lines?
in the Dict{} part:
seq = ['Chico', 'Ivy', 'Ri']
dict = dict.fromkeys(seq) #fromkeys()
print("New_dict : %s" % str(dict))
New_dict : {'Chico': None, 'Ivy': None, 'Ri': None}
#Why the seq be converted to a list? Is it achieved by fromkeys()?
for i in range(1,11):
print(i2)
#Should we add a footnote that here i2 can also write as (i * i) ? Maybe that will be more clear for students?
if number_of_users <= 50000:
... cost = 10000
Can we make a warm remind that the indent helps to define which statement is under the "if" control? It's so easy to make mistakes for new learners.
The range() function returns a sequence of numbers, starting from 0 by default, and increments by 1 (by default).(1,10) means values from 1 to 11 (but not including 6)
#here why not including 6?? Could you explain a little bit?
i = 1
while i < 9:
print(i)
if i == 5:
break
>>> i = i + 1 # why this condition is under the 'break'? As a new learner, I will think all flows go from top to down and when encounter the 'break', the flow will stop there, so may need some explanation.
return 'My name is {self.name}, and I'm {self.age} years old'.format**(self=self)**
@hupili Can you add some content on instance variables and class variables? I think the "class" part has not been fully developed so it may be confusing.
Also, in the final example, better to explain the relationship between class method (Account) and instance method (deposit and withdraw), and how to call these functions.
Thanks for your hard work!
redirected from: Lsn-cecilia/homework2#1
@ChicoXYC you can use the recent update to the code repo as an example.
After adding back this TODO, I'll announce the further challenges/ exercises in week 00
This chapter really surprises me because it is so fruitful, with full of knowledge points about statements, expressions, operators, functions, modules, methods, etc. and even though before reading Ch3 I learned parts of the same fundamental knowledge with a series of simple practices, it's still a tough work to finish all practice and check throughout this chapter.
In some parts, one or two sentences to describe what one knowledge point the chapter introduces can serve for real programming in practice. In other words, one or two sentences to provide its meanings.
Since this chapter is so substantial, fresh learner may feel their brains about to explode if they learn it at a time, so I think it can be better if we separate this chapter into three or four small sections by using, for example, emphasis fonts.
In the part of "List methods" and Example 7:
More description of While loop, and probably provide a comparison between If loop and While loop so as to allow students to learn about their functions better.
The exhibition of examples needs some hints. For examples 15 & 16, students may feel confused what the target is to write these while loop codes.
Repare two small bugs:
cost = 10000 + 0.1 * (number_of_users - 50000)
.The actual number of users we have now is 120,000.
should be blended into the question part with grey color font, which will be easier for learners to read.
import json
filename = 'population.json'
with open(filename) as f:
pop_data = json.load(f)
for pop_dict in pop_data:
if pop_dict['Year'] == '2011':
country_name = pop_dict['Country Name']
population = pop_dict['Value']
print(country_name + ": " + population)
[
{
"Country Code": "TZA",
"Country Name": "Tanzania",
"Value": 47570902.0,
"Year": 2011
},
{
"Country Code": "TZA",
"Country Name": "Tanzania",
"Value": 49082997.0,
"Year": 2012
}
]
Example:
Trouble description
The Jupiter tutorial(module-jupyter.md) lacks instructions for Windows 10 users like me, and yesterday it took me almost two hours to figure out how to install virtual environment and Jupyter Notebook on my laptop, as the codes to input are different than those in Linux. Hence, I have written down the instructions to install virtual environment and Jupyter. I think it will be helpful for Windows users if the following content can be added into the tutorial.
Instructions
You need to create virtual environment as well to use Jupyter Notebook if your operating system is Windows. However, the codes you need to input are a little bit different than those in Linux.
Create a folder named 'venv'. You can place it wherever you like, be it disk C, D or E. In this case, I put it in disk D.
Press Windows key+R, which shows you the 'RUN' box. Input cmd
and click 'OK'.
Enter python -m venv test
, so you create a virtual environment called 'test', and you can see the newly created 'test' folder.
Input activate.bat
. Now you can see (test) appear in front of the command line prompt, it means you have entered the virtual environment!
Press 'Ctrl+C' two times to quit the Jupyter notebook, and input deactivate
to exit virtual environment.
So next time you need to enter the environment and launch jupyter notebook, here are all of the codes you need to input in cmd.exe in order.
In Example 6, there are lines following:
test = [0]
print(test,'is',bool(test))
[0] is False
However, in my and my Python's opinion, the result should be true. Am I right?
@ChicoXYC here are two revisions.
hupili/python-for-data-and-media-communication-gitbook
. Not the the repo you used to demo gh-pages. Please be consistent in the examples.Obviously, with further learning, the difficulty of our open book also increases gradually, especially in this Chapter, since there are many knowledge that we have not been touched before, such as Xpath, Selenium, CSS, etc. I really recommend other learners to obtain simple definition about this terms, which can help us to study effectively. Besides, sincerely thanks a lot for our TA @ChicoXYC for his explanation. After our discussion, my problems are solved completely, so I show my explanation and feedback below and hope this work can help others to learn it better.
Code: element = browser.find_element_by_name("q")
Q(1): Why the element is("q")
instead of others?
A(1): When we find element, there is a strategy that find an unique element to locate what we want. After my test, the name("q")
can be changed to element = browser.find_element_by_id('lst-ib')
, which also can work successfully.
Code:browser.execute_script("window.scrollTo(0,1200);")
Q(2): How to find accurate number directly? such as 1200.
A(2): Actually, the number cannot be found directly, we have to test many times to check the right number so that locate the page we need. You can try (0,300) or (0,600), it's funny.
Code:browser.find_elements_by_xpath("//div[@id='summaryList_mixed']//div[@class='summaryBlock']")
Q: Why these two tags'summaryList_mixed'
&'summaryBlock'
can be ensured? What is the regular pattern?
A: Firstly, we can find the 'headline', 'date' and 'url' are what we need and all of them are hidden in 'summaryBlock'. However, due to there are 10 this kind of elements, we need to find upper level 'summaryList_mixed' to locate accurately. Then, we can use for loop to scrape all data we need.
Code:browser.execute_script('window.scrollTo(0, document.body.scrollHeight/1.5);')
Q: Why here is different in compared to the above code browser.execute_script("window.scrollTo(0,1200);")
.
A: In this code, we cannot set a fixed number to locate page, as there are 10 pages(or more pages in future ) we need scrape. Each page has a different length, cause the length of the abstract and title of each page is different. Thus, (0, document.body.scrollHeight/1.5);')
means scroll the page from bottom to top 1.5, this way can help us click the 'next' button in every page.
Maybe you can write the method of how to find path in Mac before Navigating part, which might help others work efficiently rather than wasting time to learn how to find path.
That's all my feedback, welcome to discuss together or point out my problems. I'm afraid of having wrong understanding to mislead other learners.
Tick when resolved:
$
? the username / path before $
? The user's input and the shell's output. Think of the questions that first time readers would ask. Use this screenshot, plus some annotations, to explain in details.Use the following thread for discussion.
I realised one difficulty many groups encountered last time is unable to crawl some websites/ data sources they intended to. While it is impossible to enumerate potential cases and barriers, I decide to make more examples. @ChicoXYC please collect the crawling ideas from our past students that:
I will evaluate those ideas and make sample codes for those general issues.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.