Giter VIP home page Giter VIP logo

dataset-examples's Introduction

Build Status

Yelp's Academic Dataset Examples

We're providing three examples for use with the datasets available at http://www.yelp.com/dataset_challenge and http://www.yelp.com/academic_dataset. They all depend on mrjob and python 2.6 or later.

To install all dependencies: $ pip install -e .

To test: $ tox

Samples

json_to_csv_converter: Convert the dataset from json format to csv format.

$ python json_to_csv_converter.py yelp_academic_dataset.json # Creates yelp_academic_dataset.csv

category_predictor: Given some text, predict likely categories. For example:

$ python category_predictor/category_predictor.py yelp_academic_dataset.json > category_predictor.json
$ python category_predictor/predict.py category_predictor.json "bacon donut"
Category: "Food" - 82.66% chance
Category: "Restaurants" - 16.99% chance
Category: "Donuts" - 0.12% chance
Category: "Basque" - 0.02% chance
Category: "Spanish" - 0.02% chance

review_autopilot: Use a markov chain to finish a review. For example:

$ python review_autopilot/generate.py Food 'They have the best'
They have the best coffee is good food was delicious cookies and
a few friends i think they make this

positive_category_words: See the Yelp engineering blog for details about this example. In short, it generates positivity scores for words either globally or per-category.

Basic set-up

You can use any of mjrob's runner with these examples, but we'll focus on the local and emr runner (if you have access to your own hadoop cluster, check out the mrjob docs for instructions on how to set this up).

Local mode couldn't be easier:

# this step will take a VERY long time
python review_autopilot/autopilot.py yelp_academic_dataset.json > autopilot.json

# this should be instant
python review_autopilot/generate.py Food 'They have the best'
> hot dogs ever

Waiting a long time is kind of lame, no? Let's try the same thing using EMR.

First off, you'll need an aws_access_key and an aws_secret_access_key. You can get these from the AWS console (you'll need to sign up for an AWS developer account and enable s3 / emr usage, if you haven't already).

Create a simple mrjob.conf file, like this:

runners:
  emr:
    aws_access_key_id: YOUR_ACCESS_KEY
    aws_secret_access_key: YOUR_SECRET_KEY

Now that that's done, you can run the autopilot script on EMR.

# WARNING: this will cost you roughly $2 and take 10-20 minutes
python review_autopilot/autopilot.py --num-ec2-instances 10 --ec2-instance-type c1.medium -v --runner emr yelp_academic_dataset.json

You can save money (and time) by re-using jobflows and uploading the dataset to a personal, private s3 bucket - check out the mrjob docs for instructions on doing this.

dataset-examples's People

Contributors

bllchmbrs avatar sc932 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dataset-examples's Issues

yelp_db.sql - Table "friend" should have two foreign keys

Hi, I loaded Yelp dataset into MySQL and I inferred the schema from it.
According to my understanding of the schema, the table "friend" should be a join table joining a user to another friend (user).
So I expected to have two foreign keys one starting from 'friend.user_id' and pointing to 'user.id' and the second one starting from 'friend.friend_id' and pointing to 'user.id' as well.

CONSTRAINT `fk_friends_user1` FOREIGN KEY (`user_id`) REFERENCES `user` (`id`) ON DELETE NO ACTION ON UPDATE NO ACTION,  
CONSTRAINT `fk_friends_user2` FOREIGN KEY (`friend_id`) REFERENCES `user` (`id`) ON DELETE NO ACTION ON UPDATE NO ACTION

Does this make sense?

Thanks a lot.

Convert JSON to CSV: Missing module source and missing undefined variable

I get the following errors when trying to convert the JSON files to CSV. I am new to programming so I'd appreciate assistance resolving the issue.

Line 10: import simplejson as json
Import "simplejson" could not be resolved from source Pylance(reportMissingModuleSource) [Ln 10, Col 8]

Line 96: if isinstance(line_value, unicode):
"unicode" is not defined Pylance(reportUndefinedVariable) [Ln 96, Col 35]

def get_row(line_contents, column_names):
    """Return a csv compatible row given column names and a dict."""
    row = []
    for column_name in column_names:
        line_value = get_nested_value(
                        line_contents,
                        column_name,
                        )
        if isinstance(line_value, unicode):
            row.append('{0}'.format(line_value.encode('utf-8')))
        elif line_value is not None:
            row.append('{0}'.format(line_value))
        else:
            row.append('')
    return row

JSON_to_CSV_Converter

Hi I am trying to convert json code into CSV and I am using the code that you have provided, however I am running into some errors. When I write the $ python json_to_csv_converter.py yelp_academic_dataset.json in the commandline i am getting the following error:

Traceback (most recent call last):
File "json_to_csv_converter.py", line 122, in column_names=get_superset_of_column_names_from_file<json_file>
File "json_to_csv_converter.py", line 25, in get_superset_of_column_names_from_file
for line in fin:
File "C:\Users\Bengi\Appdata\Local\Programs\Python\Python35-32\lib\encodings\cp1252.py" line 23, in decode
return codecs.charmap_decode(input, self_errors,decoding_table)[0]
Unicode Decode Error: 'charmap' codec cant decode byte 0X9d in position 1102: character maps to

Can you help me please?

question about date availability for business in NYC

According to the decritption of the dataset:

This set includes information about local businesses in 10 metropolitan areas across 2 countries.

I'm using the dataset for a study focusing on health care business in New York City, and I need the information for all health care related business and their reviews in New York City. Right after converting the .json file to .csv without any further cleaning, I explored the dataset and found that there is only one row in the dataset is "New York" city and 22 rows are "NY" state. Thus, I wonder if this is a mistake by me or the original dataset is in fact uncomprehensive.

In addtion, I also checked the Fusion API, but there is a number limit for both business and reviews:

This endpoint returns up to 1000 businesses based on the provided search criteria.

This endpoint returns up to three review excerpts for a given business ordered by Yelp's default sort order.

Is there any way that I can access all the health care business in NYC and review?

Review_count

hy fellows.
When i extract the 'review.json' file then it have not all number of reviews in it which is mention in "review _count" variable. Is there any solution where can i fixed it. For example if a user have '24' reviews then some reviews are missing by default.

Including not recommended reviews

I'm highly interested in collecting non-recommended Yelp reviews.
In the current version of the dataset, only recommended reviews are included.
Can you share your not recommended reviews as well? or some way of getting them?
I appreciate any help you can provide.

Some columns are blank

So after successfully converts json to csv, when I open it up, most columns are blank.

I'm using Python 2.7.6 and OS X Yosemite. Any ideas why this is happening?

Thanks!

City has incorrect value

This record has a city vaue that looks like an address:

{
"attributes": {
"Accepts Credit Cards": true ,
"Parking": {
"garage": false ,
"lot": true ,
"street": false ,
"valet": false ,
"validated": false
} ,
"Price Range": 2 ,
"Wheelchair Accessible": true
} ,
"business_id":  "PMMoI3CzzIH0opgW8Qzp-A" ,
"categories": [
"Local Services" ,
"Self Storage"
] ,
"city":  "1023 E Frye Rd" ,
"full_address":  "Phoenix
1023 E Frye Rd, AZ 85048" ,
"hours": {
"Friday": {
"close":  "17:00" ,
"open":  "08:00"
} ,
"Monday": {
"close":  "17:00" ,
"open":  "08:00"
} ,
"Saturday": {
"close":  "17:00" ,
"open":  "08:00"
} ,
"Sunday": {
"close":  "17:00" ,
"open":  "08:00"
} ,
"Thursday": {
"close":  "17:00" ,
"open":  "08:00"
} ,
"Tuesday": {
"close":  "17:00" ,
"open":  "08:00"
} ,
"Wednesday": {
"close":  "17:00" ,
"open":  "08:00"
}
} ,
"id":  "0b6337cb-eb62-4f00-a5b0-92b2baca0a45" ,
"latitude": 33.2978479771866 ,
"longitude": -112.061276435852 ,
"name":  "Ahwatukee Foothills Storage" ,
"neighborhoods": [ ],
"open": true ,
"review_count": 3 ,
"stars": 2.5 ,
"state":  "AZ" ,
"type":  "business"
}

Can not import the dataset into python

with open('yelp_dataset_challenge_academic_dataset',encoding='utf-8') as f:
jsondata=json.load(f)
I try to import the dataset into python with the code above, but failed. The error is that 'utf-8' codec can't decode byte 0xb5. I also try encoding='charmap', but it didn't work either. Can anyone tell me how to import the data.

yelp_db.sql has errors when trying to import into MySQL

mysql -u root -p yelp_db < yelp_db.sql
aborting with the error:

ERROR 1064 (42000) at line 881: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ''PvjfVRC_OlmOlIu' at line 1

Quick question about data

Hi,
I am a lil confused about the dataset. The dataset that I have/downloaded- that yelp provides are 5 different json files(yelp_academic_dataset_business.json, yelp_academic_dataset_checkin.json, yelp_academic_dataset_review.json, yelp_academic_dataset_tip.json, yelp_academic_dataset_user.json)..

However, in the examples given in this repo uses some yelp_academic_dataset.json which I do not have.. Like used in this example:

python category_predictor/category_predictor.py yelp_academic_dataset.json > category_predictor.json

By the way, people mention 2 datasets (yelp.com/dataset_challenge and yelp.com/academic_dataset) but they lead to the same page!

Yelp csv file is not clean.

"Categories" column in data-set is split into subsequent columns. A simple query to view "state" column is throwing chunks from previous column's strings. Reason may be comma used in many columns. Any workaround?

Citation issue

I would like to know if there is a way to properly cite the usage of this dataset. Thank you!

Survivorship

Does the data also include restaurants/firms that have closed down, i.e. not just a list of firms that survived up and till publishing ?

  • I will also post the answer on Kaggle, where I have similarly asked the question.

Regards & Thanks

Is the first entry in the variable “categories” the superordinate category?

Hi everybody,
Yelp uses a category system groupe the different business. One of the superordinate category is “Restaurants”. My question is, if in the Dataset the first entry in the variable ”categories” is always the superordinate category.

How was the variable “categories” made?
How looks the data structure of the CSV files after the JSON converter.

I work with the CSV csv files provided by yelp on Kaggle: https://www.kaggle.com/yelp-dataset/yelp-dataset/version/6?#yelp_review.csv

json_to_csv_converter.py fixed for Python 3

"""Convert the Yelp Dataset Challenge dataset from json format to csv.

For more information on the Yelp Dataset Challenge please visit http://yelp.com/dataset_challenge

"""
import argparse
import collections
import csv
import simplejson as json


def read_and_write_file(json_file_path, csv_file_path, column_names):
    """Read in the json dataset file and write it out to a csv file, given the column names."""
    with open(csv_file_path, 'w') as fout:
        csv_file = csv.writer(fout)
        csv_file.writerow(list(column_names))
        with open(json_file_path) as fin:
            for line in fin:
                line_contents = json.loads(line)
                #print(column_names, line_contents)
                csv_file.writerow(get_row(line_contents, column_names))

def get_superset_of_column_names_from_file(json_file_path):
    """Read in the json dataset file and return the superset of column names."""
    column_names = set()
    with open(json_file_path) as fin:
        for line in fin:
            line_contents = json.loads(line)
            column_names.update(
                    set(get_column_names(line_contents).keys())
                    )
    return column_names

def get_column_names(line_contents, parent_key=''):
    """Return a list of flattened key names given a dict.

    Example:

        line_contents = {
            'a': {
                'b': 2,
                'c': 3,
                },
        }

        will return: ['a.b', 'a.c']

    These will be the column names for the eventual csv file.

    """
    column_names = []
    for k, v in line_contents.items():
        column_name = "{0}.{1}".format(parent_key, k) if parent_key else k
        if isinstance(v, collections.MutableMapping):
            column_names.extend(
                    get_column_names(v, column_name).items()
                    )
        else:
            column_names.append((column_name, v))
    return dict(column_names)

def get_nested_value(d, key):
    """Return a dictionary item given a dictionary `d` and a flattened key from `get_column_names`.
    
    Example:

        d = {
            'a': {
                'b': 2,
                'c': 3,
                },
        }
        key = 'a.b'

        will return: 2
    
    """
    if '.' not in key:
        if key not in d:
            return None
        return d[key]
    base_key, sub_key = key.split('.', 1)
    if base_key not in d:
        return None
    sub_dict = d[base_key]
    if sub_dict is None:
        return None
    return get_nested_value(sub_dict, sub_key)

def get_row(line_contents, column_names):
    """Return a csv compatible row given column names and a dict."""
    row = []
    for column_name in column_names:
        line_value = get_nested_value(
                        line_contents,
                        column_name,
                        )
        # print (line_value)
        if isinstance(line_value, str):
            row.append(line_value)
        elif line_value is not None:
            row.append(line_value)
        else:
            row.append('')
    print(row)
    return row

if __name__ == '__main__':
    """Convert a yelp dataset file from json to csv."""

    parser = argparse.ArgumentParser(
            description='Convert Yelp Dataset Challenge data from JSON format to CSV.',
            )

    parser.add_argument(
            'json_file',
            type=str,
            help='The json file to convert.',
            )

    args = parser.parse_args()

    json_file = args.json_file
    csv_file = '{0}.csv'.format(json_file.split('.json')[0])

    column_names = get_superset_of_column_names_from_file(json_file)
    read_and_write_file(json_file, csv_file, column_names)

More/Fewer total reviews per business than in 'review_count'

When merging the review and business datasets, I've noticed that the 'review_count' dataset in the business level data is often different than the actual # of reviews in the data for each business. Additionally, some businesses only have 2 total reviews included in the data. Use the 'review_count' variable with caution.

Question regarding the time zones in the review.json file

I have a question. I’m currently writing my master thesis on on the relationship between the time of posting an online review and the belonging rating, length and emotional tone of an online review.

In order to do a proper data analysis on this I want to use the Yelp Academic Dataset (2020), containing the following variables: https://www.yelp.com/dataset/documentation/main In the variable ‘date’ there is information on the date obviously, but also on the time of writing the review I suppose (this is not specifically mentioned), as the format is in: "2016-08-29 00:41:13” when you download it? My main question on this would be, in what timezone did Yelp record (or convert) all this DateTime (e.g."2016-08-29 00:41:13”) information of the variable ‘date’ in the academic dataset of Yelp as part of the ‘review.json’ information? Since I need to convert all timezones back to the original location, it is quite important that I have this piece of information right. I saw that the Yelp API uses PST as time zone as this is specifically mentioned: "The time that the review was created in PST." However, this is not mentioned in the aforementioned documentation of the Yelp dataset (see aforementioned link), so that is why I’m asking. Maybe Yelp has converted the DateTime notation of the Yelp academic dataset into another format, so I want to validate this piece of information. Could someone please help me out? Thank you very much in advance.

Link to the data in 2020: https://www.yelp.com/dataset

input protocol code error for autopilot

a patch for autopilot.py -- do you want me to make this a pull request?

  • from mrjob.protocol import JSONProtocol
  • from mrjob.protocol import JSONValueProtocol
  • DEFAULT_INPUT_PROTOCOL = 'json_value'
  • INPUT_PROTOCOL = JSONValueProtocol

Check-in data is not user-related?

Hi there, I have a question about the check-in data of yelp dataset, I wanna a user's check-in history, but it seems that the check-in data has no mapping relationship with the user data?

Extract review.json file from yelp_academic_dataset.json

Hi,

I have a question about extracting the review.json from yelp_academic_dataset.json. I just execute the following:

python review_autopilot/autopilot.py yelp_academic_dataset.json > autopilot.json

the autopilot.json = review.json or not?

Thanks

simplejson module not found

Hey, I came across your codebase, and I'm interested in using your json_to_csv_converter.py file to convert the .json files to .csv files from the yelp data challenge. However, when I try to run it, it can't locate where the simplejson module is. I was just wondering where that file was or how I could find it.

Thanks!

Categories / Attributes / Hours are still nested

Hi all,

I'm currently having some issues with the JSON to CSV --> my main objective is to create a flat file where hours, categories, and attributes are also completely flat (and looking at the code, it seems like that is the intent of the script). However, running the code provided, the categories, attributes, and hours are still in nested form as opposed to creating new columns for each category/binary for each row.

Here's a summary of output:

image

With respect to the script, I'm using exactly what's given here (except hard coded file path instead of taking it as an input).

Any advice would be greatly appreciated!

Get Yelp original json file

How can I get the original json datasets?
The file "yelp_dataset.tar" I downloaded from Yelp is tar file instead of tar.gz file as described
There is no json file after uncompressing.

'NoneType' object is not subscriptable error when yelp_academic_dataset_business.json conversion

I`ve converted all datasets in yelp_dataset except yelp_academic_dataset_business.json.
It returns a error:

Traceback (most recent call last):
  File "json_to_csv_converter2.py", line 129, in <module>
    read_and_write_file(json_file, csv_file, column_names)
  File "json_to_csv_converter2.py", line 21, in read_and_write_file
    csv_file.writerow(get_row(line_contents, column_names))
  File "json_to_csv_converter2.py", line 97, in get_row
    column_name,
  File "json_to_csv_converter2.py", line 89, in get_nested_value
    return get_nested_value(sub_dict, sub_key)
  File "json_to_csv_converter2.py", line 84, in **get_nested_value**
    return d[key]
TypeError: 'NoneType' object is not subscriptable

I`ve changed in script from:

if '.' not in key:
        if key not in d:
            return None
        return d[key]

to

if '.' not in key:
        if d==None:
            return None
        elif key not in d:
            return None
        return d[key]

and problem solved. But i do not think that`s the right way...

script error while running json to csv converter

Hi,

I am trying to run the json to csv converter script but am getting the following error. May I know what am I doing wrong ?

Traceback (most recent call last):
File "json_to_csv_converter.py", line 108, in
parser = optparse.ArgumentParser(
ttributeError: 'module' object has no attribute 'ArgumentParser'

json_to_csv_converter.py has a small issue for the new yelp dataset

For the file business.json, there are records (for example the 4th record) where for example the key hours exists but its value is None. This causes an exception in the function get_nested_value.

To fix it, I suggest to add the following to that function before the last line (return get_nested_value(sub_dict, sub_key)):

if sub_dict is None:
return None

Problem running category_predictor.py

Hi Everyone,
I am trying to run category_predictor.py but I get the error "string indices must be integers and not str" I try to run it with the command “category_predictor.py C:\Python27\yelpsample.txt” s the error in the category_predictor.py code? Is it in my command line entry? Is it in the .txt file I submit? The sample pulls strings directly from the yep academic dataset. Any assistance is appreciated.

Issue with JSON to CSV converter

hi,

I got the following issue while using the python converter to convert the json to csv format:

Traceback (most recent call last):
  File "json_to_csv_converter.py", line 122, in <module>
    column_names = get_superset_of_column_names_from_file(json_file)
  File "json_to_csv_converter.py", line 28, in get_superset_of_column_names_from
_file
    line_contents = json.loads(line)
  File "C:\Users\rbaral\Anaconda2\lib\site-packages\simplejson\__init__.py", lin
e 516, in loads
    return _default_decoder.decode(s)
  File "C:\Users\rbaral\Anaconda2\lib\site-packages\simplejson\decoder.py", line
 370, in decode
    obj, end = self.raw_decode(s)
  File "C:\Users\rbaral\Anaconda2\lib\site-packages\simplejson\decoder.py", line
 400, in raw_decode
    return self.scan_once(s, idx=_w(s, idx).end())
simplejson.scanner.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Thanks.

Problem with running examples due to mrjob

Im trying to run these examples in windows and I get a vague error that says that there is something wrong with invoke sort.

File "F:/Yelp/dataset-examples-master/review_autopilot/autopilot.py", line 151, in
ReviewAutoPilot().run()
File "C:\Python27\lib\site-packages\mrjob\job.py", line 494, in run
mr_job.execute()
File "C:\Python27\lib\site-packages\mrjob\job.py", line 512, in execute
super(MRJob, self).execute()
File "C:\Python27\lib\site-packages\mrjob\launch.py", line 147, in execute
self.run_job()
File "C:\Python27\lib\site-packages\mrjob\launch.py", line 208, in run_job
runner.run()
File "C:\Python27\lib\site-packages\mrjob\runner.py", line 458, in run
self._run()
File "C:\Python27\lib\site-packages\mrjob\sim.py", line 191, in _run
self._invoke_sort(self._step_input_paths(), sort_output_path)
File "C:\Python27\lib\site-packages\mrjob\runner.py", line 1225, in _invoke_sort
proc.stdin.write(buf)
IOError: [Errno 22] Invalid argument

Question about dataset download

where can download the dataset for yelp challenge round 8 ? I need this data set to validate the experimental results of a paper.

CVS Convertor for python 3

Current version of convertor is not compatible with python 3. Here is the fixed version:

`# -- coding: utf-8 --
"""Convert the Yelp Dataset Challenge dataset from json format to csv.

For more information on the Yelp Dataset Challenge please visit http://yelp.com/dataset_challenge

"""
import argparse
import collections
import csv
import json

def read_and_write_file(json_file_path, csv_file_path, column_names):
"""Read in the json dataset file and write it out to a csv file, given the column names."""
with open(csv_file_path, 'w+') as fout:
csv_file = csv.writer(fout)
csv_file.writerow(list(column_names))
with open(json_file_path) as fin:
for line in fin:
line_contents = json.loads(line)
csv_file.writerow(get_row(line_contents, column_names))

def get_superset_of_column_names_from_file(json_file_path):
"""Read in the json dataset file and return the superset of column names."""
column_names = set()
with open(json_file_path) as fin:
for line in fin:
line_contents = json.loads(line)
column_names.update(
set(get_column_names(line_contents).keys())
)
return column_names

def get_column_names(line_contents, parent_key=''):
"""Return a list of flattened key names given a dict.

Example:

    line_contents = {
        'a': {
            'b': 2,
            'c': 3,
            },
    }

    will return: ['a.b', 'a.c']

These will be the column names for the eventual csv file.

"""
column_names = []
for k, v in line_contents.items():
    column_name = "{0}.{1}".format(parent_key, k) if parent_key else k
    if isinstance(v, collections.MutableMapping):
        column_names.extend(
                get_column_names(v, column_name).items()
                )
    else:
        column_names.append((column_name, v))
return dict(column_names)

def get_nested_value(d, key):
"""Return a dictionary item given a dictionary d and a flattened key from get_column_names.

Example:

    d = {
        'a': {
            'b': 2,
            'c': 3,
            },
    }
    key = 'a.b'

    will return: 2

"""
if '.' not in key:
    if key not in d:
        return None
    return d[key]
base_key, sub_key = key.split('.', 1)
if base_key not in d:
    return None
sub_dict = d[base_key]
return get_nested_value(sub_dict, sub_key)

def get_row(line_contents, column_names):
"""Return a csv compatible row given column names and a dict."""
row = []
for column_name in column_names:
line_value = get_nested_value(
line_contents,
column_name,
)
if isinstance(line_value, str):
row.append('{0}'.format(line_value.encode('utf-8')))
elif line_value is not None:
row.append('{0}'.format(line_value))
else:
row.append('')
return row

if name == 'main':
"""Convert a yelp dataset file from json to csv."""

parser = argparse.ArgumentParser(
        description='Convert Yelp Dataset Challenge data from JSON format to CSV.',
        )

parser.add_argument(
        'json_file',
        type=str,
        help='The json file to convert.',
        )

args = parser.parse_args()

json_file = args.json_file
csv_file = '{0}.csv'.format(json_file.split('.json')[0])

column_names = get_superset_of_column_names_from_file(json_file)
read_and_write_file(json_file, csv_file, column_names)`

Can I know which photo belongs to which review

Thank you for compiling the dataset. Currently the photo metadata looks like:

{
    // string, 22 character unique photo id
    "photo_id": "_nN_DhLXkfwEkwPNxne9hw",
    // string, 22 character business id, maps to business in business.json
    "business_id" : "tnhfDv5Il8EaGSXZGiuQGg",
    // string, the photo caption, if any
    "caption" : "carne asada fries",
    // string, the category the photo belongs to, if any
    "label" : "food"
}

I'd like to know if it is possible to get this metadata for the photo (eg. user_id), so I know which user contributes this photo?

Error in setting up autopilot.json

When I run the command ..
yelp_academic_dataset.json > autopilot.json

I get the following error..

no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
Traceback (most recent call last):
File "review_autopilot/autopilot.py", line 143, in
ReviewAutoPilot().run()
File "/usr/local/lib/python2.7/dist-packages/mrjob/job.py", line 461, in run
mr_job.execute()
File "/usr/local/lib/python2.7/dist-packages/mrjob/job.py", line 479, in execute
super(MRJob, self).execute()
File "/usr/local/lib/python2.7/dist-packages/mrjob/launch.py", line 153, in execute
self.run_job()
File "/usr/local/lib/python2.7/dist-packages/mrjob/launch.py", line 216, in run_job
runner.run()
File "/usr/local/lib/python2.7/dist-packages/mrjob/runner.py", line 470, in run
self._run()
File "/usr/local/lib/python2.7/dist-packages/mrjob/sim.py", line 164, in _run
_error_on_bad_paths(self.fs, self._input_paths)
File "/usr/local/lib/python2.7/dist-packages/mrjob/sim.py", line 549, in _error_on_bad_paths
"None found in %s" % paths)
ValueError: At least one valid path is required. None found in ['yelp_academic_dataset.json']

No idea how to resolve this. Any idea why I am facing this ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.