An open source book on Python tailed for communication students with zero background

Python 100.00%

python journalism marketing communication data data-analysis data-visualization

python-for-data-and-media-communication-gitbook's People

Contributors

Stargazers

Watchers

python-for-data-and-media-communication-gitbook's Issues

question about array

In the image, why is the highlight part '1,2,4' as a result of print? How come the result?
And I'm not quite understand 'Similarly, the first number is to index elements in this array, the second number is to index the sub-elements in each elements', could anyone help me out?

word frequency challenge based on ch3, and "in" operator of dict and list

New challenge is added: https://github.com/hupili/python-for-data-and-media-communication-gitbook/blob/master/notes-week-03.md#simple-word-frequency

@ChicoXYC , I find the dict in operator missing. Please add to the corresponding section.

Welcome to paste your answers below!

Questions and Feedback for Chapter 04

Feedback

Just partial feedback. I suppose we can add notes in the part “csv_reader” in week 04, because fresh learners may don’t understand the difference between virtual environment and their computer devices. If they just download the ‘name_list.csv’ , it cannot be found in terminal or jupyter notebook, then there will be FileNotFoundError. Thus, we can note that: “Before you operate this code, please place the “name_list.csv” under file “venv”, because jupyter notebook just operate in virtual environment so that the source also just from file “venv.”

An alternative way to calculate the break-even point of subscribed users to make profit in Example21 in Week3

The Example 21 in Week3 tutorial is helpful for us to learn how the For loop works. But I also find that the same result can be achieved with the While loop as well. Below are the codes that I write to share with you guys for reference and suggestions. You can also check the results in nbviewer here.

Fixed_Cost = 30000
Content_Cost = 70000
member_ff = 15
convert_rate = 0.1
ad_revenue_each_person = 1
subscribers=0
net_income=0.1*subscribers*member_ff+ad_revenue_each_person*subscribers-Fixed_Cost-Content_Cost
if net_income==0:
	print(subscribers)
else:
	while subscribers>=0:
		subscribers=subscribers+1
		if subscribers<50000:
			net_income=0.1*subscribers*member_ff+ad_revenue_each_person*subscribers-Fixed_Cost-Content_Cost
		else:
			net_income=0.1*subscribers*member_ff+ad_revenue_each_person*subscribers-Fixed_Cost-Content_Cost-0.1*(subscribers-50000)
		if net_income==0:
			break
print(subscribers)

issues about installing jupyter in CVA517

Here are the instruction of how to install jupyter generally, we typing following command to create virtual environment then install all dependencies and modules. But in lab, there are some problem:

pyvenv venv
source venv/bin/activate
pip3 install jupyter

You will get the error of upgrade your pip version during install jupyter, just copy the command.

pip install --upgrade pip

Then, install jupyter, open the jupyter notebook

pip3 install jupyter
jupyter notebook

then you will encounter another problem, whenever you execute what, it shows its in running and there is a sign saying that kernel starting, please wait... The reason can be found in the terminal
that when you install jupyter, there is an red line saying:

ipython 7.0.1 has requirement prompt-toolkit<2.1.0,>=2.0.0, but you'll have prompt-toolkit 1.0.15 which is incompatible.

In oder to solve this, there is two solutions which can be found in jupyter_kernel issue#158

solution 1: pip install 'ipykernel<5.0.0'

solution 2: first downgraded ipython pip install -U ipython==6.5.0, then prompt-tookit pip install -U prompt-toolkit==1.0.15

Finally, open jupyter notebook, type something to test, it should work now!

Change file naming convention

Here is the file naming convention:

https://github.com/hupili/python-for-data-and-media-communication-gitbook/blob/master/guide-for-contributor.md#file-naming-convention

Please do a quick scan in the assets folder and change the file names and their corresponding references in the markdown files.

Broken paths after moving the images

one example:

https://github.com/hupili/python-for-data-and-media-communication-gitbook/blob/master/module-python-twitter.md

Minor Problems in Week5

It is better explain to students Jupyter notebook would not present results on the screen except using "print()" command or type the variable in the new line. May causing confusion if we dont mention it.
my_urls = data.find_all('a',attrs={'class':'post__title-link js-read-more'})
I am wondering why the value here is 'post__title-link js-read-more'? I only inspect this:
「消失的檔案」：香港開放數據大檢討

Why not is this one: class="post__title-link"
One thing I can not clearly understand is the format part:
my_url['href'].split('/blog')[-1]
Can you try to explain a little bit how can this return the format we want?

for i in range(1,8): #format all pages urls
if i == 1:
page_url = url
else:
page_url = '{url_initial}page/{number}/'.format(url_initial = url,number=i)
#print(page_url)
Here using the if logic, if page ==1, then the page_url is url. If else, still need to be further formatted. My question is why we can not apply the same function to other pages? Aren't they sharing the same format?

Thanks for your hard work! Yes, this week is more difficult than before, but I think it is clear to follow. : ）

A question about week-06

Week-06

In this part of week-06, it shows "by_link_text". But only when I change it to "by_partial_link_text", can it works.

I am confused about when should we use "partial".

CH05 Feedbacks

Operating system:windows
Python version:3
Hardware:
Internet access:Y
Jupyter notebook or not? Y
Which chapter of book?:05

Feedback1:
I got an Error when trying to install BeautifulSoup using the first codes mentioned as follows

The output is as follows:

Have you Google/ Stackover flow anything? Yes

Do they solve or partially solve your question? Yes, but in a different way

Scraper yields different results upon execution

Troubleshooting

Describe your environment

Operating system:
Python version:
Hardware:
Internet access:
Jupyter notebook or not? [Y/N]: Y
Which chapter of book?: ch5

Describe your question

Following are some permalinks for codes and data.

https://github.com/zhangjingwei0512/chapter5/blob/8981ed72e972d0a458adbbc6898f5def55cef9ce/further2.py

https://github.com/zhangjingwei0512/chapter5/blob/8981ed72e972d0a458adbbc6898f5def55cef9ce/further2.csv

https://github.com/zhangjingwei0512/chapter5/blob/8981ed72e972d0a458adbbc6898f5def55cef9ce/further2(2).csv

The minimum code (snippet) to reproduce the issue

a little question

in the Example 13:

test3 = 'python loves,'you''
test3.find('you')
8 #returns the first character where 'you' begins

Is that should be 8? I think it may be 14 instead?

Chapter 4 Working Thread - API/ JSON/ CSV

Chapter 9 Working thread - data viz, presentation

@hupili - structure.

bar chart via:

end-to end workflow on gh-pages:

Exercise: Make a first page, simplest, CSS-free, single-column layout. @ChicoXYC
Put an interactive chart into the page.

random module in ch2

Address TODO comments in ch2's random module section.

Chapter 7 Working thread - pandas basics

@hupili outline.
@ChicoXYC prepare an updated openrice dataset of similar format. High Priority. Please store the scraping script and dataset in the scraper-example folders. Create new files and don't override previous openrice example.
@ChicoXYC include the URL and POI ID of each restaurant into the 20K dataset.
@ChicoXYC , address the TODO notes for pandas basics. Note that there is a bit duplicate between ch7 and ch8 (1D part). This is normal because most of the time you observe errors during analysis. So analysis skills can also help data cleaning.
@hupili, @ChicoXYC , inject some errors into the openrice dataset for data cleaning exercise.
@ChicoXYC First review and smoothen the content up to dataprep section

I have carefully went through the Chapter 2, the content is easy to understand, and I think the exercises are really helpful for new comers to better master basic knowledge about Python. Thank you for your hard work! :)

Troubleshooting

Describe your environment

Operating system:
Python version:
Hardware:
Internet access:
Jupyter notebook or not? [Y/N]:
Which chapter of book?:

Describe your question

Example: I get IOError when running my script to load files.

The minimum code (snippet) to reproduce the issue

Example:

open('path-to-a-file-not-exist')

Describe the efforts you have spent on this issue

Example:

Have you Google/ Stackover flow anything?

Do they solve or partially solve your question?

What is the closest answer you can find?

list index out of range

Troubleshooting

I wrote this programme to allocate cases for each five students. But it seems something goes wrong with the index on the line " print(list2[s])" and "print(list1[c:c+5])". How can I give value 0 to s and c, also change the values each time.

student_list =[
18421111,
18421112,
18421113,
18421114,
18421115,
18421116,
18421117,
18421118,
18421119,
18421120,
18421121,
18421122,
18421123,
18421124,
18421125,
18421126,
18421127,
18421128,
18421129,
18421130,
18421131,
18421132,
18421133,
18421134,
18421135,
18421136,
18421137,
18421138,
18421139,
18421140,
18421141,
18421142,
18421143,
18421144,
18421145,
18421146,
18421147,
18421148,
18421149,
18421150,
18421151,
18421152,
18421153,
18421154,
18421155,
18421156,
18421157,
18421158,
18421159,
18421160,
]
case_list =[
'case1 - build a calculator to evaluate your business model',
'case2 - build a automatic earthquake robot to broadcast the new earthquake',
'case3 - evaluate social media performance of a luxury brand',
'case4 - study movie blockbuster 'Dying to Survive'',
'case5 - invest your money like the Internet giant, Tencent',
'case6 - where are the 200,000 inferior vaccines flowing?',
'case7 - study classics, Who control the discourse power in 'Dream of the Red Chamber'',
'case8 - research about Didi-driver crimes in China',
'case9 - 'Me too' analysis',
'case10 - what is hip-hop in china?'
]

import random
random.shuffle(student_list)
list1=student_list
print(list1)
random.shuffle(case_list)
list2=case_list
print(list2)

s=0
c=0
for s in student_list:

print(list2[s])
s=s+1
for c in case_list:

print(list1[c:c+5])
c=c+5

Chapter 0 first version content

Python linting and debugging

Python is a dynamic language which makes error checking at compile time hard. Most error is exposed at run time. However, some tools can still help us to catch most of the errors and try to write best practice code. "Linting" is a general concept found in all programming languages that refers to the process to identify potential errors and suboptimal practices at the writing time.

The step to get "twitter API" (week04)

Which chapter of book:week04

Preparation：Twitter Account （highly recommend to use Gmail Account to Sign up!）

Step 1

Go to https://apps.twitter.com/app/new and click 'Apply for a developer account'.

Step 2

Choose “Personal use” & Enter your desired Application Name, Primary country (hk) and so on.
The most important is : Describe in your own words what you are buiding

you have to answer each question as detailed as possible. the following is an example:

1. I’m using Twitter’s APIs to practice my data collection and analysis skills in the Big Data Analysis course. I am currently a postgraduate student in Hong Kong Baptist University, majoring in Communication, and normally collect users comment data for mass communication study research.
2. As for the methods and techniques I plan to conduct, here is our course GitHub open book. You can take it as reference. https://github.com/hupili/python-for-data-and-media-communication-gitbook/blob/master/notes-week-04.md#use-api-via-function-calls-to-other-modules-packages
3. I only use it to do data collection practice, will not use case to tweeting, retweeting or liking content.
4. Tweets will be displayed on our final project presentation for academic use.
5. Finally, I will comply with the Policies of twitter.
I am looking forward to your favorable reply.

and then, submit your application! ->verify your email

and then, your application may under review or! jump to a new page :

Chapter 6 Working thread - advanced scraping

@ChicoXYC Browser simulation part - selenium
@ChicoXYC splinter
@ChicoXYC twitter example (using browser simulation, not API)
@hupili analyse network trace
@ChicoXYC , @hupili rate throttling example
@hupili user agent
@ChicoXYC @hupili common issues
@hupili mobile
@hpili quick scraping pointers

Install Python3 on Windows and Set Environment

Click here to download Python 3.7(64-bits).
If you need to install other versions of python, click here and go to the hyperlink provided.
Remember to choose 'Customize installation' and click 'pip' when you are installing python. If you install the default version, you cannot import 'numpy' module in week2.

Connor's Feedback&Question on Chapter 2

Feedback

This Chapter is really wonderful. Although I've already got a bit of fundamental knowledge about python from other channel, I also receive new inspirations and flexible programming usage because it give more practical examples and transmit an important concept-use it in daily life.

Additionally, to be a fresh learner, I also have some suggestions and questions to this Chapter.

Suggestions & Questions

Questions
In the Basic functions: Arrays, I try my best to understand the definition of "shape" but I'm still confused for it, especially in the under example.Hoping to receive a more detailed explanation of the operating rules. Thx!

>>> b = np.array([[1,2,3],[4,5,6]])    # Create a rank 2 array
>>> print(b.shape)
(2, 3)
>>> print(b[0, 0], b[0, 1], b[1, 0])
1 2 4

Suggestions
The under example regards the concept of index . However, before this example, there is any introduction about index, which may cause some difficulties to 0 background learners.

>>> import numpy as np
>>> a = np.array([1, 2, 3])   # Create a rank 1 array
>>> print(a[0], a[1], a[2]) # index elements

Thus, I think there are two effective ways.

Add the link to the 6th example of Chapter 3, which explain it exactly.
Give simple explanation about the index.

Feedback & Questions in Week5

I have to admit that this chapter is really more difficult, but also useful, which help me to build logic thinking for scraper and how to achieve it in practical application step by step. The feedback and questions are listed below according to chapter order.

1. In Get data:

(1) You seem to have lost the underline inmy_title = myh1.text . Without underline '_', this code will face NameError.

(2) Still in this code, I cannot understand what Type(myh1) means ? Maybe we can change the myh1 to h2 such as my_h1 = data.find('h2') and get output '話癆特朗普', which might help others to understand the target that using tag and attributes to extract the data we want directly.

2. In Get author try 2

(1) How do we determine the tag_name? Just like the 'tr', I'm wondering the regulations because you use 'a' as tag_name in the latter function scrape_articles_urls_of_one_page

(2)This code seem likes a dictattrs={'class':"post__authors"}, why use this format, could you explain it more detailed, or is it just syntax rules？

Thanks for all your work and help, it's meaningful !

Week 3 Feedback

The outputs in Example4 should contain no parentheses and no quotation marks.

Below is the right version:

x=4
y=6
print('x!=y:',x!=y)
x!=y: True
Remember to add parentheses, and list2[1:5]means slicing list2 from index1 value to index5 value but does not include index5 value. Also, list2[:2] means slcing list2 from index0 value to index2 value but does not include index2 value. We had better explain the rule of slicing lists more clearly. It confused me when I saw the outputs in the first place.
Delete the command of the third line in example8.
In Example10, there is no key named 'Frank' or its corresponding value.

If we try to access the value of a key that does not exist in the list, an error will be reported as follows:
In Example22, the last line（i=i+1）should only have one indentation, otherwise the output will be 1 endlessly and never breaks:

Below is the right version:
In the first line, the input('please input a int:') will produce a string, and the type of a string is never equal to the type of 1. Hence, ValueError will be always raised if you put either 2 or 2.2. The 2 and 2.2 you input are strings. Therefore, you need to add a int() function before the input(), because int('2.2') leads to a ValueError while int('2') can produce an int.

Also, in the last line, remember to add parenthesis: print (inputValue)
The right version should be:

A question about the codes to extract article url in week5 - [improve: list slicing]

For the example in week5 which extracts all article urls in http://initiumlab.com, all the tags including the article urls have been collected like below

Next, we need to get the strings after 'href' (e.g.'href="../blog/20170113-Sharing-With-Friends-Versus-Strangers/"'), and the codes are as below:

for my_url in my_urls:
    url ='{0}{1}'.format('http://initiumlab.com',my_url['href'][2:]) #format urls
        #print(url)
    article_urls.append(url)
article_urls

However, I do not quite understand what my_url['href'][2:] means here. 'my_url['href']' seems to be a function of finding the value through the corresponding key in a dictionary, but 'my_url' is a list element. [2:] seems to be a function of extracting part of words in string. I feel a little bit confused.

Chapter 15 Working Thread - machine learning primer.

CH06 Feedbacks - the correct notation (% and !) to run shell commands in Jupyter notebook

%curl https://nghttp2.org/httpbin/user-agent
{"user-agent":"curl/7.54.0"}

For this piece of coded, the output shows:

It seems that the magic function doesn't work here

How to create a file in "example" repository

Troubleshooting

Describe your environment

Operating system:mac os
Python version:
Hardware:
Internet access:
Jupyter notebook or not? [Y/N]:
Which chapter of book?:week00

Describe your question

week00- gh desktop
How to create a file in "example" repository after create a new repo.
don't understand the "drag"part. Thank you!

feedback for Charpter3

It seems that this week has large load of content! Just some humble suggestions:

in the Str Comparison part we give an example: Name1 == Name3, then return the bool value is True. Is it necessary to explain the "equal" true meaning? Since if we try "Name1 is Name3", the result would be False. So the '==' just means the two items have equal content here, doesn't mean they are the same thing.
in the List[] part:
A little bit confused about "remove()" and "pop()"...What are the differences? Could you explain in additional lines?
in the Dict{} part:

I noticed we talk about index for several times, like the list.insert(i,x), pop(i)...Maybe it's better to explain more on the index, better to accompany with demonstrating graph and let us see how are the 'identity numbers' match different items, like {1, 2, 3, 4}
[0][1][2][3]
A typing error: unorder --> disorder
Sorry for my weak understanding... It seems that the str() format has no difference with the original one (Example10) So why we use str()? It’s better to illustrate~

A tiny question:

seq = ['Chico', 'Ivy', 'Ri']
dict = dict.fromkeys(seq) #fromkeys()
print("New_dict : %s" % str(dict))
New_dict : {'Chico': None, 'Ivy': None, 'Ri': None}

#Why the seq be converted to a list? Is it achieved by fromkeys()?

for i in range(1,11):

print(i2)
#Should we add a footnote that here i2 can also write as (i * i) ? Maybe that will be more clear for students?

if number_of_users <= 50000:
... cost = 10000

Can we make a warm remind that the indent helps to define which statement is under the "if" control? It's so easy to make mistakes for new learners.

The range() function returns a sequence of numbers, starting from 0 by default, and increments by 1 (by default).(1,10) means values from 1 to 11 (but not including 6)
#here why not including 6?? Could you explain a little bit?

i = 1
while i < 9:
print(i)
if i == 5:
break
>>> i = i + 1 # why this condition is under the 'break'? As a new learner, I will think all flows go from top to down and when encounter the 'break', the flow will stop there, so may need some explanation.

return 'My name is {self.name}, and I'm {self.age} years old'.format**(self=self)**

Can we show another format for readers to better understand?
return 'My name is {0}, and I'm {1} years old'.format(self.name, self.age)'
caz I don't understand here: format**(self=self)** ...

It is better to tell readers how to call class function. For example, we should call from the class name.

@hupili Can you add some content on instance variables and class variables? I think the "class" part has not been fully developed so it may be confusing.
Also, in the final example, better to explain the relationship between class method (Account) and instance method (deposit and withdraw), and how to call these functions.

Thanks for your hard work!

How to change default branch for GitHub pages?

redirected from: Lsn-cecilia/homework2#1

Chapter 0 - fork, modify, PR

https://github.com/hupili/python-for-data-and-media-communication-gitbook/blob/master/notes-week-00.md#the-workflow-fork-repo-modify-code-and-send-pull-request

@ChicoXYC you can use the recent update to the code repo as an example.

After adding back this TODO, I'll announce the further challenges/ exercises in week 00

Chapter 8 Working thread - 1D analysis & 2D analysis

@ChicoXYC organise from S18 notes, 1D and 2D analysis.

Hatty's Feedback on Ch3

Hatty's feedback on Chapter 3

Before advice:

This chapter really surprises me because it is so fruitful, with full of knowledge points about statements, expressions, operators, functions, modules, methods, etc. and even though before reading Ch3 I learned parts of the same fundamental knowledge with a series of simple practices, it's still a tough work to finish all practice and check throughout this chapter.

Here are some bullet points I think might be useful for you:

In some parts, one or two sentences to describe what one knowledge point the chapter introduces can serve for real programming in practice. In other words, one or two sentences to provide its meanings.
Since this chapter is so substantial, fresh learner may feel their brains about to explode if they learn it at a time, so I think it can be better if we separate this chapter into three or four small sections by using, for example, emphasis fonts.
In the part of "List methods" and Example 7:
- Description: more details and no “s” behind every verb
  Examples: separate examples for diverse methods with spaces.
- Provide some practices behind every part of methods learning.
More description of While loop, and probably provide a comparison between If loop and While loop so as to allow students to learn about their functions better.
The exhibition of examples needs some hints. For examples 15 & 16, students may feel confused what the target is to write these while loop codes.
Repare two small bugs:
- In example 13 question, the formula is "cost=1000+0.1×(number_of_users -50000）" .while in the answer : cost = 10000 + 0.1 * (number_of_users - 50000) .
- In example 14,

The actual number of users we have now is 120,000.

should be blended into the question part with grey color font, which will be easier for learners to read.

man - useful command to get help

New section: https://github.com/hupili/python-for-data-and-media-communication-gitbook/blob/master/notes-week-01.md#get-inline-help-in-the-command-line

Chapter 3 Working Thread - Python for anything: condition, loop, function, class

`str` and `int` data type issue in json file

Troubleshooting

Describe your environment

Operating system:os
Python version:3
Hardware:
Internet access:
Jupyter notebook or not? [Y/N]:
Which chapter of book?:4

there is nothing print out @hupili

The minimum code (snippet) to reproduce the issue

import json

filename = 'population.json'
with open(filename) as f:
	pop_data = json.load(f)

	for pop_dict in pop_data:
		if pop_dict['Year'] == '2011':
			country_name = pop_dict['Country Name']
			population = pop_dict['Value']
			print(country_name + ": " + population)

[
	{
	"Country Code": "TZA", 
	"Country Name": "Tanzania", 
	"Value": 47570902.0, 
	"Year": 2011
	},
	{
	"Country Code": "TZA", 
	"Country Name": "Tanzania", 
	"Value": 49082997.0, 
	"Year": 2012
	}
]

Describe the efforts you have spent on this issue

Example:

Have you Google/ Stackover flow anything?

Do they solve or partially solve your question?

What is the closest answer you can find?

Complete session F2018 team list

https://github.com/hupili/python-for-data-and-media-communication-gitbook/blob/master/session-F2018.md

Instructions of Installing Jupyter Notebook on Windows

How to install virtual environment and Jupyter Notebook on Windows

Trouble description
The Jupiter tutorial(module-jupyter.md) lacks instructions for Windows 10 users like me, and yesterday it took me almost two hours to figure out how to install virtual environment and Jupyter Notebook on my laptop, as the codes to input are different than those in Linux. Hence, I have written down the instructions to install virtual environment and Jupyter. I think it will be helpful for Windows users if the following content can be added into the tutorial.

Instructions
You need to create virtual environment as well to use Jupyter Notebook if your operating system is Windows. However, the codes you need to input are a little bit different than those in Linux.

Create a folder named 'venv'. You can place it wherever you like, be it disk C, D or E. In this case, I put it in disk D.
Press Windows key+R, which shows you the 'RUN' box. Input cmd and click 'OK'.

Input D: to change to disk D.
Input cd venv to go to the 'venv' folder.
Enter python -m venv test, so you create a virtual environment called 'test', and you can see the newly created 'test' folder.
Enter cd test to go to the 'test' folder.
Input cd Scripts to go to the 'Scripts' folder.
Input activate.bat. Now you can see (test) appear in front of the command line prompt, it means you have entered the virtual environment!
Enter pip install jupyter to install jupyter.
Input jupyter notebook to launch jupyter notebook.
Press 'Ctrl+C' two times to quit the Jupyter notebook, and input deactivate to exit virtual environment.
So next time you need to enter the environment and launch jupyter notebook, here are all of the codes you need to input in cmd.exe in order.

a little question

In Example 6, there are lines following:

test = [0]
print(test,'is',bool(test))
[0] is False

However, in my and my Python's opinion, the result should be true. Am I right?

Chapter 2 Working thread

Chapter 10 Working Thread - text data

@ChicoXYC - organise the text, time series, graph notes from S18
@ChicoXYC - flight route map example from S18 final project.

Add TOC for all sections

https://github.com/hupili/python-for-data-and-media-communication-gitbook/blob/master/guide-for-contributor.md#table-of-contents-toc

Chapter 5 Working thread - basic scraping (static)

@ChicoXYC crawl the whole initiumlab.com -- need iterate the article list by pagination.

week-00: gh-pages steps

@ChicoXYC here are two revisions.

https://github.com/hupili/python-for-data-and-media-communication-gitbook/blob/master/notes-week-00.md#step-4-commit--publish . You showed a screenshot that is not part of the steps. The screenshot shows how to commit/ push to the current repo, i.e. hupili/python-for-data-and-media-communication-gitbook. Not the the repo you used to demo gh-pages. Please be consistent in the examples.
After step 4, please give a screenshot of the live website. It is also good to give the live URL. In this way, people know that it is working.

Feedback for Chapter06

Before Feedback

Obviously, with further learning, the difficulty of our open book also increases gradually, especially in this Chapter, since there are many knowledge that we have not been touched before, such as Xpath, Selenium, CSS, etc. I really recommend other learners to obtain simple definition about this terms, which can help us to study effectively. Besides, sincerely thanks a lot for our TA @ChicoXYC for his explanation. After our discussion, my problems are solved completely, so I show my explanation and feedback below and hope this work can help others to learn it better.

1: In Navigating

Code: element = browser.find_element_by_name("q")
Q(1): Why the element is("q") instead of others?

A(1): When we find element, there is a strategy that find an unique element to locate what we want. After my test, the name("q") can be changed to element = browser.find_element_by_id('lst-ib') , which also can work successfully.

Code:browser.execute_script("window.scrollTo(0,1200);")
Q(2): How to find accurate number directly? such as 1200.

A(2): Actually, the number cannot be found directly, we have to test many times to check the right number so that locate the page we need. You can try (0,300) or (0,600), it's funny.

2: In CNN articles scraping

Code:browser.find_elements_by_xpath("//div[@id='summaryList_mixed']//div[@class='summaryBlock']")
Q: Why these two tags'summaryList_mixed'&'summaryBlock' can be ensured? What is the regular pattern?

A: Firstly, we can find the 'headline', 'date' and 'url' are what we need and all of them are hidden in 'summaryBlock'. However, due to there are 10 this kind of elements, we need to find upper level 'summaryList_mixed' to locate accurately. Then, we can use for loop to scrape all data we need.

3: In scrape all pages

Code:browser.execute_script('window.scrollTo(0, document.body.scrollHeight/1.5);')
Q: Why here is different in compared to the above code browser.execute_script("window.scrollTo(0,1200);").

A: In this code, we cannot set a fixed number to locate page, as there are 10 pages(or more pages in future ) we need scrape. Each page has a different length, cause the length of the abstract and title of each page is different. Thus, (0, document.body.scrollHeight/1.5);') means scroll the page from bottom to top 1.5, this way can help us click the 'next' button in every page.

4: A little advice

Maybe you can write the method of how to find path in Mac before Navigating part, which might help others work efficiently rather than wasting time to learn how to find path.

That's all my feedback, welcome to discuss together or point out my problems. I'm afraid of having wrong understanding to mislead other learners.

Chapter 18 - Python engineering/ data engineering

File structure for Python projects
Separate crawler, parser, analysis and presentation notebooks
Concept of reusability, especially reuse in loops with functions

Chapter 1 working thread

https://github.com/hupili/python-for-data-and-media-communication-gitbook/blob/master/notes-week-01.md

Tick when resolved:

https://github.com/hupili/python-for-data-and-media-communication-gitbook/blob/master/notes-week-01.md#how-to-open-terminal-on-mac . Add a screenshot to show spotlight search result.
https://github.com/hupili/python-for-data-and-media-communication-gitbook/blob/master/notes-week-01.md#2-shell-commands . Add one screenshot of an entire Terminal to show multiple commands and their input/ output
https://github.com/hupili/python-for-data-and-media-communication-gitbook/blob/master/notes-week-01.md#2-shell-commands . Use markdown "block code" notation to demo the input/ output of commands, instead of screenshots (images). This way looks better and also allows readers to copy and paste.
https://github.com/hupili/python-for-data-and-media-communication-gitbook/blob/master/notes-week-01.md#2-shell-commands Please explain the components of the interactions in shell. That is, what is the $? the username / path before $? The user's input and the shell's output. Think of the questions that first time readers would ask. Use this screenshot, plus some annotations, to explain in details.
Please turn them those pointers into URLs. "(If it doesn't work, you can download some third party editors,such as sublime, visual studio code. You can edit .py file by these editors.)"
Rename the screenshot files to readable filename, i.e. a slug-style filename. Do not simply use the default "Screenshot ..." name. Or we will lose track in the future.
Markdown Lint

Use the following thread for discussion.

Collect challenging websites/ data sources to crawl

I realised one difficulty many groups encountered last time is unable to crawl some websites/ data sources they intended to. While it is impossible to enumerate potential cases and barriers, I decide to make more examples. @ChicoXYC please collect the crawling ideas from our past students that:

They want to crawl initially
They gave up in the end, due to unsolvable technical barriers

I will evaluate those ideas and make sample codes for those general issues.

hupili / python-for-data-and-media-communication-gitbook Goto Github PK

python-for-data-and-media-communication-gitbook's People

Contributors

Stargazers

Watchers

Forkers

python-for-data-and-media-communication-gitbook's Issues

Feedback

Week-06

Have you Google/ Stackover flow anything? Yes

Do they solve or partially solve your question? Yes, but in a different way

Troubleshooting

Describe your environment

Describe your question

The minimum code (snippet) to reproduce the issue

Troubleshooting

Describe your environment

Describe your question

The minimum code (snippet) to reproduce the issue

Describe the efforts you have spent on this issue

Have you Google/ Stackover flow anything?

Do they solve or partially solve your question?

What is the closest answer you can find?

Troubleshooting

Preparation：Twitter Account （highly recommend to use Gmail Account to Sign up!）

Step 1

Step 2

Feedback

Suggestions & Questions

Troubleshooting

Describe your environment

Describe your question

Hatty's feedback on Chapter 3

Before advice:

Here are some bullet points I think might be useful for you:

Troubleshooting

Describe your environment

there is nothing print out @hupili

The minimum code (snippet) to reproduce the issue

Describe the efforts you have spent on this issue

Have you Google/ Stackover flow anything?

Do they solve or partially solve your question?

What is the closest answer you can find?

How to install virtual environment and Jupyter Notebook on Windows

Before Feedback

Recommend Projects

Recommend Topics

Recommend Org