Australian Digital Observatory Application Programming Interface (ADO-API)
Version 1.0.0
Introduction
The Australian Digital Observatory Application Programming Interface (API) provides programmatic access to social media data and collections present in the Melbourne eResearch Group node of the project. The API is RESTful, encrypted and requires authorization to access its services. It is also synchronous, hence the responses to the queries sent to the API reflect the latest state of the data.
All in all, the functionality of the API can be broken down into three areas:
1. Aggregation
Aggregation over the number of documents in the collection can be performed at various aggregation levels. This includes by time period (day/month/year), seasonality, language, and by place (where the information is available).
2. Term frequency
Aggregation can be performed over a number of days to attain term frequency for specific terms by day.
3. Term Similarity
Term similarity can also be queried, for a given term on a certain day, based on word2vec
models built on a daily basis.
4. Topic Modelling
Topic modelling is performed on the social media collections using BERTopic daily. The API can be queried to interact with the results of these results, along with building and retrieving network graphs from consecutive days of clustering.
5. Text Search
A full-text search query that returns an array of social media post IDs. This feature allows selecting posts based on: author id, text, hashtags, date, language (or combination of these fields).
NOTE: All times are in UTC, hence every aggregation or selection by date and time may be off of about 10 hours for posters that are based on the eastern seaboard of Australia. Since there is no sure way of knowing the location of the poster at the time of posting, we preferred to use the UTC time in the API to minimize bias.
Authorization and Authentication (A&A)
Each resource in the API requires authorization by supplying a JSON web token (JWT), and a user has to go through an authentication process to acquire the JWT.
The procedure to authenticate is based on JWT and requires the following steps to be executed:
- A user has to send a POST request to the
/login
endpoint, using the basic authentication scheme, where theAPI key
has been requested to the ADO Project. This request returns either an error or a JWT token that contains the roles theAPI key
holder is entitled to. Below is an example how authentication is performed on the command line, where is theapi key
granted to the user:
export API_DEVELOPERUSER_KEY=<key string>
JWT=$(curl -XPOST -u "apikey:${API_DEVELOPERUSER_KEY}" https://api.ado.eresearch.unimelb.edu.au/login)
- The JWT token is then passed as
Authorization
header (as inBearer: <JWT token>
) to all subsequent requests.
curl -XGET 'https://api.ado.eresearch.unimelb.edu.au/version'\
--header "Authorization: Bearer ${JWT}"; echo
The JWT is only valid for 24 hours, after which another JWT has to be requested to continue API accessibility.
A&A Example in Python
Below is an example of how to attain a JWT in python (version 3.0 and over), using the requests library for http calls, where to be replaced the the api key
.
import requests
from requests.auth import HTTPBasicAuth
// user to pass their api key as a string
API_KEY = <key string>
url = "https://api.ado.eresearch.unimelb.edu.au/login"
res = requests.post(url, auth=HTTPBasicAuth('apikey', API_KEY))
if res.ok:
jwt = res.text
Below is an example of how to use the JWT with the requests.get()
function, to attain the API version.
url = 'https://api.ado.eresearch.unimelb.edu.au/version'
headers = {'Authorization': f"Bearer {jwt}"}
res = requests.get(url, headers = headers)
version = res.text
Aggregation
The aggregation endpoint is designed to perform aggregations over the collections to retrieve count data or sentiment data. Aggregation can be performed by a function of time, or by a number of descriptive properties inherent to the social media document data. The requests are synchronous and reflect the state of the data in the database, in near real-time.
Summary of a collection
The number of posts in a collection and the start and end dates of harvesting can be retrieved using:
GET /analysis/aggregate/collections/{collection}/summary
Examples:
cURL (command line)
curl -XGET 'https://api.ado.eresearch.unimelb.edu.au/analysis/aggregate/collections/twitter/summary'\ --header "Authorization: Bearer ${JWT}"
Python
import requests url = 'https://api.ado.eresearch.unimelb.edu.au/analysis/aggregate/collections/twitter/summary', headers = {'Authorization': f"Bearer {jwt}"}
res = requests.get(url, headers = headers, params=qs_params)
result = res.json()
Examples:
{"startDate":"2021-6-2", "endDate":"2023-3-22","count":2845093}
Aggregation by time
Day, month and year
Aggregation of count or sentiment can be performed to obtain a number of documents in a social media database by year, month, day of week or hour as the aggregation level. The request type and endpoint for this function are:
GET /analysis/aggregate/collections/{collection}/aggregation
where ‘collection’ is a path parameter, of the target social media collection. The query parameters for the function are:
Parameter | type | Description | Required | Example |
---|---|---|---|---|
startDate | string (date-time) | Start date for data requested | Yes | 2021-09-16 |
endDate | string (date-time) | End date for data requested | Yes | 2021-09-21 |
aggregationLevel | string | One of day , month or year | Yes | day |
sentiment | boolean | Request sentiment statistics rather than count statistics | No | ‘false |
The function defaults to a count request is sentiment is absent or false
, otherwise returns a sentiment
property (sum of the sentiments of individual posts and sentimentcount
instead of count
to avoid confusion).
Examples
Below are examples of how to aggregate twitter documents by day that have been published between 2021-07-01
and 2021-07-11
, with the output, in various programming languages
cURL (command line)
curl -XGET 'https://api.ado.eresearch.unimelb.edu.au/analysis/aggregate/collections/twitter/aggregation?aggregationLevel=day&startDate=2021-07-01&endDate=2021-07-11'\
--header "Authorization: Bearer ${JWT}"
Python
import requests
url = 'https://api.ado.eresearch.unimelb.edu.au/analysis/aggregate/collections/twitter/aggregation'
# query parameters set in the dict below
qs_params = { 'startDate' : '2021-07-01' , 'endDate':'2021-07-11', 'aggregationLevel': 'day' }
headers = {'Authorization': f"Bearer {jwt}"}
res = requests.get(url, headers = headers, params=qs_params)
result = res.json()
Sample Response
Below is a sample response for an aggregation request with startDate
as 2021-07-10
and endDate
as 2021-07-15
, with aggregationLevel
set to day
.
[
{
"time": "2021-7-10",
"count": 285671
},
{
"time": "2021-7-11",
"count": 312456
},
{
"time": "2021-7-12",
"count": 294918
},
{
"time": "2021-7-13",
"count": 289360
},
{
"time": "2021-7-14",
"count": 305276
}
]
Seasonality
Aggregation of count or sentiment can be performed to obtain a number of documents in a social media database by day of the week, as well as hour of the day within the day of the week, as the aggregation level. The request type and endpoint for this function are:
GET /analysis/aggregate/collections/{collection}/seasonality
where ‘collection’ is a path parameter, of the target social media collection. The query parameters for the function are:
Parameter | type | Description | Required | Example |
---|---|---|---|---|
startDate | string (date-time) | Start date for data requested | Yes | 2021-09-16 |
endDate | string (date-time) | End date for data requested | Yes | 2021-09-21 |
aggregationLevel | string | One of dayofweek or dayofweekhour | Yes | day |
sentiment | boolean | Request sentiment statistics rather than count statistics | No | false |
The function defaults to a count request is sentiment is absent or false
, otherwise returns the sentiment.
Examples
Below are examples of how to aggregate twitter documents by seasonality, set to the day of the week, that have been published between 2021-06-01
and 2021-07-31
, with the output, in various programming languages.
cURL (command line)
curl -XGET 'https://api.ado.eresearch.unimelb.edu.au/analysis/aggregate/collections/twitter/seasonality?aggregationLevel=dayofweek&startDate=2021-07-01&endDate=2021-07-11'\
--header "Authorization: Bearer ${JWT}"
Python
import requests
url = 'https://api.ado.eresearch.unimelb.edu.au/analysis/aggregate/collections/twitter/seasonality'
# query parameters set in the dict below
qs_params = { 'startDate' : '2021-07-01' , 'endDate':'2021-07-11', 'aggregationLevel': 'dayofweek' }
headers = {'Authorization': f"Bearer {jwt}"}
res = requests.get(url, headers = headers, params=qs_params)
result = res.json()
Sample Response
Below is a sample response for a seasonality request with startDate
as 2021-07-01
and endDate
as 2021-07-11
, with aggregationLevel
set to dayofweek
.
curl -XGET 'https://api.ado.eresearch.unimelb.edu.au/analysis/aggregate/collections/twitter/seasonality?aggregationLevel=dayofweek&startDate=2021-07-01&endDate=2021-07-31'\
--header "Authorization: Bearer ${JWT}"
[
{
"time": "friday",
"count": 1377965
},
{
"time": "monday",
"count": 1004482
},
{
"time": "tuesday",
"count": 922970
},
{
"time": "wednesday",
"count": 1220736
},
{
"time": "thursday",
"count": 1167768
},
{
"time": "saturday",
"count": 1240801
},
{
"time": "sunday",
"count": 937304
}
]
Aggregation by descriptive property
A fraction of the social media data (~7%) contains some geographic information about the origin of the post. However, the specificity of this origin can span from just the country, down to suburbs or landmarks. Therefore, the place information of a tweet was standardized to the following levels:
{
countrycode: (*|string|null),
statecode: (*|null),
gccsacode: (*|null),
salcode: (*|null)
}
where each of the levels was inferred from the location data present in the document. NOTE: the original data was modified with the above standardization, the documents were transformed separately in CouchDB MapReduce views, hence the original location data as received from the harvesters is still as it was received. If one of the levels can’t be inferred, it is left as null
.
Social media documents also contain the language
property that can be aggregated over to retrieve the count or sentiment by language over a time period.
Language and place
Aggregation of count or sentiment can be performed to obtain a number of documents by language, with further aggregation possible to obtain the count or sentiment of language by place. The request type and endpoint for this function are:
GET /analysis/language/collections/{collection}
where ‘collection’ is a path parameter, of the target social media collection. The query parameters for the function are:
Parameter | type | Description | Required | Example |
---|---|---|---|---|
startDate | string (date-time) | Start date for data requested | Yes | 2021-09-16 |
endDate | string (date-time) | End date for data requested | Yes | 2021-09-21 |
aggregationLevel | string | One of language , country , state , gccsa and suburb | Yes | language |
sentiment | boolean | Request sentiment statistics rather than count statistics | No | false |
Examples
Below are examples of how to aggregate twitter documents that have been published between 2021-07-01
and 2021-07-11
by language, with the output in various programming languages.
cURL
curl -XGET 'https://api.ado.eresearch.unimelb.edu.au/analysis/language/collections/twitter?aggregationLevel=language&startDate=2021-07-01&endDate=2021-07-11'\
--header "Authorization: Bearer ${JWT}"
Python
import requests
url = 'https://api.ado.eresearch.unimelb.edu.au/analysis/language/collections/twitter'
# query parameters set in the dict below
qs_params = { 'startDate' : '2021-07-01' , 'endDate':'2021-07-11', 'aggregationLevel': 'language' }
headers = {'Authorization': f"Bearer {jwt}"}
res = requests.get(url, headers = headers, params=qs_params)
result = res.json()
Sample Response
The language codes conform to the alpha-3/ISO 639-2 standard.
Aggregation Level: language
Below is an abridged sample response (NOTE: the actual response was much larger) for a language request with startDate
as 2021-07-01
and endDate
as 2021-07-11
, with aggregationLevel
set to language
:
curl -XGET 'https://api.ado.eresearch.unimelb.edu.au/analysis/language/collections/twitter?aggregationLevel=language&startDate=2021-07-01&endDate=2021-07-11'\
--header "Authorization: Bearer ${JWT}"
[
{
"language": "is",
"count": 1
},
{
"language": "pa",
"count": 1
},
{
"language": "sd",
"count": 1
},
{
"language": "si",
"count": 1
},
{
"language": "te",
"count": 1
},
{
"language": "vi",
"count": 1
},
{
"language": "bg",
"count": 1
},
{
"language": "bn",
"count": 1
},
{
"language": "kn",
"count": 1
},
{
"language": "mr",
"count": 1
},
{
"language": "sd",
"count": 1
}
]
Aggregation Level: state
Below is an abridged sample response (NOTE: the actual response was much larger) for a language request with startDate
as 2021-07-27
and endDate
as 2021-07-31
, with aggregationLevel
set to state
:
curl -XGET 'https://api.ado.eresearch.unimelb.edu.au/analysis/language/collections/twitter?aggregationLevel=state&startDate=2021-07-27&endDate=2021-07-31'\
--header "Authorization: Bearer ${JWT}"
[
{
"language": "ar",
"country": "au",
"state": "2",
"count": 25
},
{
"language": "ar",
"country": "au",
"state": "3",
"count": 22
},
{
"language": "ar",
"country": "au",
"state": "4",
"count": 3
},
{
"language": "ar",
"country": "au",
"state": "5",
"count": 19
},
{
"language": "ar",
"country": "au",
"state": "8",
"count": 2
},
{
"language": "ar",
"country": "sa",
"state": null,
"count": 2
}
]
Aggregation Level: gccsa
Below is an abridged sample response (NOTE: the actual response was much larger) for a language request with startDate
as 2021-07-27
and endDate
as 2021-07-31
, with aggregationLevel
set to gccsa
:
curl -XGET 'https://api.ado.eresearch.unimelb.edu.au/analysis/language/collections/twitter?aggregationLevel=gccsa&startDate=2021-07-27&endDate=2021-07-31'\
--header "Authorization: Bearer ${JWT}"
[
{
"language": "und",
"country": "au",
"state": "7",
"gccsa": "7rnte",
"count": 3
},
{
"language": "und",
"country": "au",
"state": "8",
"gccsa": "8acte",
"count": 29
},
{
"language": "und",
"country": "eg",
"state": null,
"gccsa": null,
"count": 1
},
{
"language": "und",
"country": "in",
"state": null,
"gccsa": null,
"count": 6
},
{
"language": "und",
"country": "pk",
"state": null,
"gccsa": null,
"count": 3
},
{
"language": "und",
"country": "th",
"state": null,
"gccsa": null,
"count": 1
},
{
"language": "und",
"country": "us",
"state": null,
"gccsa": null,
"count": 7
},
{
"language": "ur",
"country": null,
"state": null,
"gccsa": null,
"count": 136
},
{
"language": "ur",
"country": "au",
"state": "1",
"gccsa": "1gsyd",
"count": 1
},
{
"language": "ur",
"country": "au",
"state": "2",
"gccsa": "2gmel",
"count": 2
},
{
"language": "ur",
"country": "au",
"state": "3",
"gccsa": "3gbri",
"count": 3
},
{
"language": "ur",
"country": "au",
"state": "8",
"gccsa": "8acte",
"count": 1
},
{
"language": "vi",
"country": null,
"state": null,
"gccsa": null,
"count": 23
},
{
"language": "vi",
"country": "au",
"state": "1",
"gccsa": "1gsyd",
"count": 1
},
{
"language": "vi",
"country": "au",
"state": "4",
"gccsa": "4gade",
"count": 2
},
{
"language": "zh",
"country": null,
"state": null,
"gccsa": null,
"count": 1420
},
{
"language": "zh",
"country": "au",
"state": "1",
"gccsa": "1gsyd",
"count": 19
},
{
"language": "zh",
"country": "au",
"state": "1",
"gccsa": "1rnsw",
"count": 5
},
{
"language": "zh",
"country": "au",
"state": "2",
"gccsa": null,
"count": 1
},
{
"language": "zh",
"country": "au",
"state": "2",
"gccsa": "2gmel",
"count": 26
},
{
"language": "zh",
"country": "au",
"state": "3",
"gccsa": "3gbri",
"count": 6
},
{
"language": "zh",
"country": "au",
"state": "5",
"gccsa": "5gper",
"count": 27
},
{
"language": "zh",
"country": "au",
"state": "8",
"gccsa": "8acte",
"count": 1
}
]
Aggregation Level: suburb
Below is an abridged sample response (NOTE: the actual response was much larger) for a language request with startDate
as 2021-07-27
and endDate
as 2021-07-31
, with aggregationLevel
set to suburb
:
curl -XGET 'https://api.ado.eresearch.unimelb.edu.au/analysis/language/collections/twitter?aggregationLevel=suburb&startDate=2021-07-27&endDate=2021-07-31'\
--header "Authorization: Bearer ${JWT}"
[
{
"language": "und",
"country": "au",
"state": "5",
"gccsa": "5rwau",
"suburb": "50228",
"count": 1
},
{
"language": "und",
"country": "au",
"state": "5",
"gccsa": "5rwau",
"suburb": "50492",
"count": 1
},
{
"language": "und",
"country": "au",
"state": "5",
"gccsa": "5rwau",
"suburb": "50536",
"count": 1
},
{
"language": "und",
"country": "au",
"state": "5",
"gccsa": "5rwau",
"suburb": "50602",
"count": 1
},
{
"language": "und",
"country": "au",
"state": "5",
"gccsa": "5rwau",
"suburb": "51281",
"count": 3
},
{
"language": "und",
"country": "au",
"state": "6",
"gccsa": null,
"suburb": null,
"count": 5
},
{
"language": "und",
"country": "au",
"state": "6",
"gccsa": "6ghob",
"suburb": "60051",
"count": 1
},
{
"language": "und",
"country": "au",
"state": "6",
"gccsa": "6ghob",
"suburb": "60276",
"count": 20
},
{
"language": "und",
"country": "au",
"state": "6",
"gccsa": "6rtas",
"suburb": "60156",
"count": 1
},
{
"language": "und",
"country": "au",
"state": "6",
"gccsa": "6rtas",
"suburb": "60253",
"count": 1
},
{
"language": "und",
"country": "au",
"state": "6",
"gccsa": "6rtas",
"suburb": "60322",
"count": 3
},
{
"language": "und",
"country": "au",
"state": "7",
"gccsa": null,
"suburb": null,
"count": 1
},
{
"language": "und",
"country": "au",
"state": "7",
"gccsa": "7gdar",
"suburb": "70073",
"count": 11
},
{
"language": "und",
"country": "au",
"state": "7",
"gccsa": "7rnte",
"suburb": "70005",
"count": 1
},
{
"language": "und",
"country": "au",
"state": "7",
"gccsa": "7rnte",
"suburb": "70133",
"count": 1
},
{
"language": "und",
"country": "au",
"state": "7",
"gccsa": "7rnte",
"suburb": "70241",
"count": 1
},
{
"language": "und",
"country": "au",
"state": "8",
"gccsa": "8acte",
"suburb": "80003",
"count": 29
}
]
Place and Language
Aggregation of count or sentiment can be performed to obtain a number of documents by place, with further aggregation possible to obtain the count or sentiment of place by language. The request type and endpoint for this function are:
GET /analysis/place/collections/{collection}
where ‘collection’ is a path parameter, of the target social media collection. The query parameters for the function are:
Parameter | type | Description | Required | Example |
---|---|---|---|---|
startDate | string (date-time) | Start date for data requested | Yes | 2021-09-16 |
endDate | string (date-time) | End date for data requested | Yes | 2021-09-21 |
aggregationLevel | string | One of country , state , gccsa , suburb and language | Yes | language |
sentiment | boolean | Request sentiment statistics rather than count statistics | No | false |
The places are represented by codes according to the following standards:
Place level | Standard |
Country | ISO-3166 two-letter alpha code |
State | Australian Statistical Geography Standard – States and Territories |
Greater Capital City Statistical Area | Australian Statistical Geography Standard – GCCSA |
Suburb or locality | Australian Statistical Geography Standard – Suburbs and Localities |
Examples
Below are examples of how to aggregate twitter documents that have been published between 2021-07-01
and 2021-07-1
by place, with the output in various programming languages.
cURL
curl -XGET 'https://api.ado.eresearch.unimelb.edu.au/analysis/place/collections/twitter?aggregationLevel=country&startDate=2021-07-01&endDate=2021-07-11'\
--header "Authorization: Bearer ${JWT}"; echo
Python
import requests
url = 'https://api.ado.eresearch.unimelb.edu.au/analysis/place/collections/twitter'
# query parameters set in the dict below
qs_params = { 'startDate' : '2021-06-01' , 'endDate':'2021-07-31', 'aggregationLevel': 'country' }
headers = {'Authorization': f"Bearer {jwt}"}
res = requests.get(url, headers = headers, params=qs_params)Location
result = res.json()
Sample Response
Aggregation Level: suburb
Below is an abridged sample response (NOTE: the actual response was much larger) for a place request with startDate
as 2021-07-26
and endDate
as 2021-07-31
, with aggregationLevel
set to state
:
curl -XGET 'https://api.ado.eresearch.unimelb.edu.au/analysis/place/collections/twitter?aggregationLevel=state&startDate=2021-07-26&endDate=2021-07-31'\
--header "Authorization: Bearer ${JWT}"
[
{
"country": "au",
"state": "1",
"count": 6693
},
{
"country": "au",
"state": "2",
"count": 6721
},
{
"country": "au",
"state": "3",
"count": 3228
},
{
"country": "au",
"state": "4",
"count": 1389
},
{
"country": "au",
"state": "5",
"count": 1601
},
{
"country": "au",
"state": "6",
"count": 445
},
{
"country": "au",
"state": "7",
"count": 202
},
{
"country": "au",
"state": "8",
"count": 457
}
]
Aggregation Level: gccsa
Below is an abridged sample response (NOTE: the actual response was much larger) for a place request with startDate
as 2021-07-26
and endDate
as 2021-07-31
, with aggregationLevel
set to gccsa
:
curl -XGET 'https://api.ado.eresearch.unimelb.edu.au/analysis/place/collections/twitter?aggregationLevel=gccsa&startDate=2021-07-26&endDate=2021-07-31'\
--header "Authorization: Bearer ${JWT}"
[
{
"country": "au",
"state": "5",
"gccsa": "5rwau",
"count": 184
},
{
"country": "au",
"state": "6",
"gccsa": null,
"count": 109
},
{
"country": "au",
"state": "6",
"gccsa": "6ghob",
"count": 195
},
{
"country": "au",
"state": "6",
"gccsa": "6rtas",
"count": 141
},
{
"country": "au",
"state": "7",
"gccsa": null,
"count": 20
},
{
"country": "au",
"state": "7",
"gccsa": "7gdar",
"count": 135
},
{
"country": "au",
"state": "7",
"gccsa": "7rnte",
"count": 47
},
{
"country": "au",
"state": "8",
"gccsa": null,
"count": 6
},
{
"country": "au",
"state": "8",
"gccsa": "8acte",
"count": 451
},
{
"country": "au",
"state": "9",
"gccsa": "9oter",
"count": 1
}
]
Aggregation Level: suburb
curl -XGET 'https://api.ado.eresearch.unimelb.edu.au/analysis/place/collections/twitter?aggregationLevel=suburb&startDate=2021-07-26&endDate=2021-07-31'\
--header "Authorization: Bearer ${JWT}"
[
{
"country": "au",
"state": "6",
"gccsa": "6rtas",
"suburb": "60322",
"count": 89
},
{
"country": "au",
"state": "6",
"gccsa": "6rtas",
"suburb": "60367",
"count": 5
},
{
"country": "au",
"state": "6",
"gccsa": "6rtas",
"suburb": "60426",
"count": 1
},
{
"country": "au",
"state": "6",
"gccsa": "6rtas",
"suburb": "60427",
"count": 1
},
{
"country": "au",
"state": "6",
"gccsa": "6rtas",
"suburb": "60437",
"count": 3
},
{
"country": "au",
"state": "6",
"gccsa": "6rtas",
"suburb": "60634",
"count": 1
},
{
"country": "au",
"state": "6",
"gccsa": "6rtas",
"suburb": "60675",
"count": 1
},
{
"country": "au",
"state": "6",
"gccsa": "6rtas",
"suburb": "60702",
"count": 2
},
{
"country": "au",
"state": "6",
"gccsa": "6rtas",
"suburb": "60717",
"count": 1
},
{
"country": "au",
"state": "6",
"gccsa": "6rtas",
"suburb": "60749",
"count": 6
},
{
"country": "au",
"state": "6",
"gccsa": "6rtas",
"suburb": "60758",
"count": 3
},
{
"country": "au",
"state": "7",
"gccsa": null,
"suburb": null,
"count": 20
},
{
"country": "au",
"state": "7",
"gccsa": "7gdar",
"suburb": "70073",
"count": 135
},
{
"country": "au",
"state": "7",
"gccsa": "7rnte",
"suburb": "70005",
"count": 21
},
{
"country": "au",
"state": "7",
"gccsa": "7rnte",
"suburb": "70108",
"count": 23
},
{
"country": "au",
"state": "7",
"gccsa": "7rnte",
"suburb": "70133",
"count": 1
},
{
"country": "au",
"state": "7",
"gccsa": "7rnte",
"suburb": "70241",
"count": 1
},
{
"country": "au",
"state": "7",
"gccsa": "7rnte",
"suburb": "70251",
"count": 1
},
{
"country": "au",
"state": "8",
"gccsa": null,
"suburb": null,
"count": 6
},
{
"country": "au",
"state": "8",
"gccsa": "8acte",
"suburb": "80003",
"count": 420
},
{
"country": "au",
"state": "8",
"gccsa": "8acte",
"suburb": "80012",
"count": 25
},
{
"country": "au",
"state": "8",
"gccsa": "8acte",
"suburb": "80083",
"count": 1
},
{
"country": "au",
"state": "8",
"gccsa": "8acte",
"suburb": "80092",
"count": 4
},
{
"country": "au",
"state": "8",
"gccsa": "8acte",
"suburb": "80113",
"count": 1
},
{
"country": "au",
"state": "9",
"gccsa": "9oter",
"suburb": "90003",
"count": 1
},
{
"country": "br",
"state": null,
"gccsa": null,
"suburb": null,
"count": 4
}
]
Aggregation Level: language
Below is an abridged sample response (NOTE: the actual response was much larger) for a place request with startDate
as 2021-07-26
and endDate
as 2021-07-31
, with aggregationLevel
set to language
:
[
{
"country": "au",
"state": "8",
"gccsa": "8acte",
"suburb": "80003",
"language": "fr",
"count": 2
},
{
"country": "au",
"state": "8",
"gccsa": "8acte",
"suburb": "80003",
"language": "hi",
"count": 1
},
{
"country": "au",
"state": "8",
"gccsa": "8acte",
"suburb": "80003",
"language": "in",
"count": 7
},
{
"country": "au",
"state": "8",
"gccsa": "8acte",
"suburb": "80003",
"language": "ja",
"count": 1
},
{
"country": "au",
"state": "8",
"gccsa": "8acte",
"suburb": "80003",
"language": "no",
"count": 1
},
{
"country": "au",
"state": "8",
"gccsa": "8acte",
"suburb": "80003",
"language": "pl",
"count": 1
},
{
"country": "au",
"state": "8",
"gccsa": "8acte",
"suburb": "80003",
"language": "th",
"count": 3
},
{
"country": "au",
"state": "8",
"gccsa": "8acte",
"suburb": "80003",
"language": "tl",
"count": 2
},
{
"country": "au",
"state": "8",
"gccsa": "8acte",
"suburb": "80003",
"language": "und",
"count": 29
},
{
"country": "au",
"state": "8",
"gccsa": "8acte",
"suburb": "80003",
"language": "ur",
"count": 1
},
{
"country": "au",
"state": "8",
"gccsa": "8acte",
"suburb": "80003",
"language": "zh",
"count": 1
},
{
"country": "au",
"state": "8",
"gccsa": "8acte",
"suburb": "80012",
"language": "en",
"count": 25
},
{
"country": "au",
"state": "8",
"gccsa": "8acte",
"suburb": "80083",
"language": "en",
"count": 1
},
{
"country": "au",
"state": "8",
"gccsa": "8acte",
"suburb": "80092",
"language": "en",
"count": 4
},
{
"country": "au",
"state": "8",
"gccsa": "8acte",
"suburb": "80113",
"language": "en",
"count": 1
},
{
"country": "au",
"state": "9",
"gccsa": "9oter",
"suburb": "90003",
"language": "en",
"count": 1
}
]
Term Frequency
The ability to analyze the count frequency of terms present in a social media collection, within a certain date range, is also possible. The terms are stemmed using the nltk PorterStemmer function, so the words that can be queried are stem words. E.g. “Likes”, ”liked”, ”likely” and ”liking” will be reduced to “like” after stemming.
All terms
The terms available, along with their aggregated count, in a time period can be queried at the following endpoint:
GET /analysis/term/collections/{collection}
where ‘collection’ is a path parameter, of the target social media collection. The query parameters for the function are:
Parameter | type | Description | Required | Example |
---|---|---|---|---|
startDate | string (date-time) | Start date for data requested | Yes | 2021-09-16 |
endDate | string (date-time) | End date for data requested | Yes | 2021-09-21 |
The results from this endpoint can also be a precursor to the Specific Terms request, to determine which stem words are available to be queried for daily count statistics.
Examples
cURL
curl -XGET 'https://api.ado.eresearch.unimelb.edu.au/analysis/terms/collections/twitter?startdate=2021-07-26&enddate=2021-08-10'\
--header "Authorization: Bearer ${JWT}"
Python
import requests
url = 'https://api.ado.eresearch.unimelb.edu.au/analysis/terms/collections/twitter'
# query parameters set in the dict below
qs_params = { 'startDate' : '2021-07-26' , 'endDate':'2021-08-10'}
headers = {'Authorization': f"Bearer {jwt}"}
res = requests.get(url, headers = headers, params=qs_params)
result = res.json()
Sample response
Below is an abridged sample response (NOTE: the actual response was much larger) for an All terms request with startDate
as 2021-07-26
and endDate
as 2021-08-10
:
curl -XGET 'https://api.ado.eresearch.unimelb.edu.au/analysis/terms/collections/twitter?startdate=2021-07-26&enddate=2021-08-10'\
--header "Authorization: Bearer ${JWT}"
{
"terms": {
"number": 29444,
"nurs": 3984,
"nut": 1987,
"octob": 2378,
"odd": 1525,
"off": 2559,
"offer": 6118,
"offic": 10772,
"offici": 3868,
"oil": 2963,
"okay": 5562,
"old": 3689,
"olymp": 25792,
"omg": 9915,
"one": 24539,
"onli": 1945,
"onlin": 8532,
"open": 4987,
"oper": 4458,
"opinion": 9746,
"opportun": 8799,
"opposit": 3761,
"option": 9490,
"orang": 1640,
"order": 13610,
"organ": 1620,
"organis": 3635,
"origin": 2704,
"other": 18474,
"out": 4424,
"outbreak": 6269,
"outcom": 3295,
"outdoor": 1078,
"outfit": 1252,
"over": 879,
"owner": 3989,
"pacif": 589,
"pack": 3184,
"packag": 2538,
"page": 7929,
"pain": 5848,
"paint": 2334,
"pair": 2111,
"pandem": 5462,
"panel": 2095,
"pant": 1879,
"paper": 11548,
"parent": 9057,
"park": 6372,
"parliament": 2676,
"part": 29335,
"parti": 15366,
"particip": 1335,
"partner": 5619,
"pass": 5302,
"passion": 2211,
"passport": 2642,
"past": 2968,
"path": 2811,
"patienc": 1316,
"patient": 5572,
"pay": 8529,
"payment": 5798,
"pcr": 1563,
"peac": 3594,
"peak": 993,
"penalti": 2638,
"pension": 587,
"peopl": 173508,
"perfect": 3930,
"perform": 7532,
"period": 6783,
"permiss": 246,
}
}
Specific terms
The daily frequencies of specific terms of interest can be queried at the following endpoint:
GET /analysis/terms/collections/{collection}/term
where ‘collection’ is a path parameter, of the target social media collection. The query parameters for the function are:
Parameter | type | Description | Required | Example |
---|---|---|---|---|
startDate | string (date-time) | Start date for data requested | Yes | 2021-09-16 |
endDate | string (date-time) | End date for data requested | Yes | 2021-09-21 |
terms | string | List of terms in comma separated format | Yes | covid,scomo,vaccin |
Examples
cURL
curl -XGET 'https://api.ado.eresearch.unimelb.edu.au/analysis/terms/collections/twitter/term?startdate=2021-07-13&enddate=2021-07-31&terms=scomo,vaccin'\
--header "Authorization: Bearer ${JWT}"; echo
Python
import requests
url = 'https://api.ado.eresearch.unimelb.edu.au/analysis/terms/collections/twitter/term'
# query parameters set in the dict below
qs_params = { 'startDate' : '2021-07-13' , 'endDate':'2021-07-31', 'terms' : 'scomo,vaccin' }
headers = {'Authorization': f"Bearer {jwt}"}
res = requests.get(url, headers = headers, params=qs_params)
result = res.json()
Sample response
Below is an abridged sample response (NOTE: the actual response was much larger) for an All terms request with startDate
as 2021-07-26
and endDate
as 2021-08-10
, with the terms
set to scomo,vaccin
:
curl -XGET 'https://api.ado.eresearch.unimelb.edu.au/analysis/terms/collections/twitter/term?startdate=2021-07-13&enddate=2021-07-31&terms=scomo,vaccin'\
--header "Authorization: Bearer ${JWT}"
{
"scomo": [
{
"date": "2021-7-13",
"count": 132
},
{
"date": "2021-7-14",
"count": 142
},
{
"date": "2021-7-15",
"count": 198
},
{
"date": "2021-7-16",
"count": 148
},
{
"date": "2021-7-17",
"count": 118
},
{
"date": "2021-7-18",
"count": 107
},
{
"date": "2021-7-19",
"count": 138
},
{
"date": "2021-7-20",
"count": 134
},
{
"date": "2021-7-21",
"count": 225
},
{
"date": "2021-7-22",
"count": 229
},
{
"date": "2021-7-23",
"count": 227
},
{
"date": "2021-7-24",
"count": 162
},
{
"date": "2021-7-25",
"count": 164
},
{
"date": "2021-7-26",
"count": 133
},
{
"date": "2021-7-28",
"count": 139
},
{
"date": "2021-7-29",
"count": 144
},
{
"date": "2021-7-30",
"count": 191
}
],
"vaccin": [
{
"date": "2021-7-13",
"count": 3262
},
{
"date": "2021-7-14",
"count": 3175
},
{
"date": "2021-7-15",
"count": 3080
},
{
"date": "2021-7-16",
"count": 3193
},
{
"date": "2021-7-17",
"count": 2858
},
{
"date": "2021-7-18",
"count": 2910
},
{
"date": "2021-7-19",
"count": 4156
},
{
"date": "2021-7-20",
"count": 3978
},
{
"date": "2021-7-21",
"count": 4492
},
{
"date": "2021-7-22",
"count": 5197
},
{
"date": "2021-7-23",
"count": 6336
},
{
"date": "2021-7-24",
"count": 6047
},
{
"date": "2021-7-25",
"count": 5163
},
{
"date": "2021-7-26",
"count": 4124
},
{
"date": "2021-7-27",
"count": 3680
},
{
"date": "2021-7-28",
"count": 4039
},
{
"date": "2021-7-29",
"count": 4578
},
{
"date": "2021-7-30",
"count": 4599
}
]
}
Term Similarity
In the active natural language processing pipeline on the ADO cluster, word embedding models are built on corpus data derived from a single day’s worth of text data, using word2vec
. Word embedding models can be used to find how similar words are to each other, in the embedding space, by computing the cosine distance of each word vector. Similarities change across different days, depending on the semantic relationships of the terms (often how frequently they appear together).
Terms available
To view the terms available for querying, of a specific word embedding model, the following endpoint can be queried:
GET /analysis/nlp/collections/{collection}/days/{day}/terms
where ‘collection’ is a path parameter, of the target social media collection, and ‘day’ is the day of the model requested:
Examples
cURL
curl -XGET 'https://api.ado.eresearch.unimelb.edu.au/analysis/nlp/collections/twitter/days/2021-07-09/terms'\
--header "Authorization: Bearer ${JWT}"; echo
Python
import requests
day = '2021-07-09'
collection = 'twitter'
url = f'https://api.ado.eresearch.unimelb.edu.au/analysis/nlp/collections/{collection}/days/{day}/terms'
# query parameters set in the dict below
headers = {'Authorization': f"Bearer {jwt}"}
res = requests.get(url, headers = headers)
result = res.json()
Query word embedding models
To query a model and retrieve the top 25 most similar terms, along with their cosine similarities, the following endpoint can be queried:
GET /analysis/nlp/collections/{collection}/days/{day}/terms/{term}
where collection
is a path parameter, of the target social media collection. day
is a path parameter in the format YYYY-MM-DD
, as well as term
which is the term of being queried.
Examples
cURL
curl -XGET 'https://api.ado.eresearch.unimelb.edu.au/analysis/nlp/collections/twitter/days/2021-07-09/terms/vaccin'\
--header "Authorization: Bearer ${JWT}"
Python
import requests
term = 'vaccin'
day = '2021-07-09'
url = f'https://api.ado.eresearch.unimelb.edu.au/analysis/nlp/collections/twitter/days/{day}/terms/{term}'
# query parameters set in the dict below
headers = {'Authorization': f"Bearer {jwt}"}
res = requests.get(url, headers = headers)
result = res.json()
Sample Response
Below is a sample response for the term vaccin
on the day 2021-07-09
:
curl -XGET 'https://api.ado.eresearch.unimelb.edu.au/analysis/nlp/collections/twitter/days/2021-07-09/terms/vaccin'\
--header "Authorization: Bearer ${JWT}"
{
"agenda": 0.5740534067153931,
"approv": 0.6987619400024414,
"australian": 0.6622416973114014,
"blame": 0.6074656844139099,
"blood": 0.6090492010116577,
"booster": 0.8681329488754272,
"bullshit": 0.6017955541610718,
"cabinet": 0.5635865330696106,
"campaign": 0.6179795265197754,
"capac": 0.5706251859664917,
"claim": 0.5647196769714355,
"clinic": 0.7478079199790955,
"concern": 0.5908010005950928,
"condit": 0.5645681023597717,
"conspiraci": 0.6220813393592834,
"coronaviru": 0.6344245672225952,
"coverag": 0.6052238941192627,
"covid": 0.7316592931747437,
"crisi": 0.6347731351852417,
"damag": 0.5770517587661743,
"danger": 0.6578422784805298,
"deal": 0.6643673181533813,
"death": 0.57118159532547,
"delta": 0.5919604301452637,
"disabl": 0.5580296516418457,
"disast": 0.608578085899353,
"diseas": 0.7518317699432373,
"doctor": 0.7485677003860474,
"dose": 0.7945512533187866,
"effect": 0.6603673696517944,
"emerg": 0.6338611841201782,
"evid": 0.5946788787841797,
"facil": 0.6528164148330688,
"fact": 0.5785788297653198,
"fail": 0.5803477168083191,
"failur": 0.7815086245536804,
"fed": 0.6356635093688965,
"feder": 0.5594338178634644,
"figur": 0.7420264482498169,
"flu": 0.7349779605865479,
"gov": 0.6872922778129578,
"govern": 0.5914908647537231,
"govt": 0.6604992151260376,
"headlin": 0.6021615266799927,
"hospitalis": 0.6953837275505066,
"hub": 0.723802387714386,
"hunt": 0.5952842235565186,
"icu": 0.6137961149215698,
"ill": 0.620633602142334,
"immun": 0.8382552266120911,
"incompet": 0.5905853509902954,
"increas": 0.592781662940979,
"infect": 0.6805194616317749,
"jab": 0.8495683670043945,
"lack": 0.5874945521354675,
"lie": 0.5710851550102234,
"major": 0.5966668128967285,
"mass": 0.8208097219467163,
"medicin": 0.6375464200973511,
"million": 0.7198325991630554,
"msm": 0.6013255715370178,
"outbreak": 0.5976635813713074,
"pandem": 0.6687020659446716,
"patient": 0.5810028314590454,
"pfizer": 0.900665283203125,
"phase": 0.6531299352645874,
"popul": 0.7427656054496765,
"proof": 0.5772579312324524,
"propaganda": 0.5715232491493225,
"protect": 0.7308518886566162,
"quarantin": 0.6936514973640442,
"rate": 0.6851494312286377,
"report": 0.6093935966491699,
"respons": 0.631423830986023,
"risk": 0.68557208776474,
"roll": 0.7304665446281433,
"rollout": 0.912151038646698,
"scientist": 0.5725030303001404,
"scomo": 0.6933631896972656,
"seem": 0.6128174066543579,
"shot": 0.6573461294174194,
"spin": 0.5988937020301819,
"spread": 0.5883970856666565,
"stat": 0.5837931632995605,
"statement": 0.654644250869751,
"strain": 0.6624536514282227,
"suppli": 0.8358686566352844,
"symptom": 0.7199208736419678,
"target": 0.5763600468635559,
"theori": 0.5651994347572327,
"thousand": 0.572020411491394,
"total": 0.5963312387466431,
"treatment": 0.6949573755264282,
"trial": 0.5938171744346619,
"variant": 0.6742241978645325,
"vax": 0.9031803011894226,
"ventil": 0.6273638010025024,
"viru": 0.6889223456382751,
"which": 0.6370202302932739,
"zero": 0.6220378279685974
}
Topic Modelling
Topic modelling is performed daily on a corpus derived from the previous day’s social media posts. As a result, a list of clusters is obtained for every day, which represents the topics of that particular day. In the metadata of each topic cluster there is the following information:
• the size of the cluster
• the list of document ids in the cluster
• the top 30 terms (by frequency) of the cluster
• the aggregate pairwise similarity of the 30 terms, computed using a word embedding model (word2vec) which was built with the same corpus. A variety of API endpoints allow access to this data.
Topics
To get the list of topics on certain days (in a date range), with the top 30 terms representing each topic, the following endpoint can be queried.
GET /analysis/nlp/collections/{collection}/topics
where collection
is a path parameter, of the target social media collection. The query parameters for the endpoint are:
Parameter | type | Description | Required | Example |
---|---|---|---|---|
startDate | string (date-time) | Start date for data requested | Yes | 2021-09-16 |
endDate | string (date-time) | End date for data requested | Yes | 2021-09-21 |
fullResult | boolean | when set to true, it returns the whole clustering information, otherwise returns only the top terms and size of each cluster | No | false |
Examples
cURL
curl -XGET 'https://api.ado.eresearch.unimelb.edu.au/analysis/nlp/collections/twitter/topics?startDate=2021-07-15&endDate=2021-07-31'\
--header "Authorization: Bearer ${JWT}"
Python
import requests
url = 'https://api.ado.eresearch.unimelb.edu.au/analysis/nlp/collections/twitter/topics'
# query parameters set in the dict below
headers = {'Authorization': f"Bearer {jwt}"}
qs_params = { 'startDate' : '2021-07-15' , 'endDate':'2021-07-31' }
res = requests.get(url, headers = headers, params=qs_params)
result = res.json()
Sample responses
Below is an abridged sample response (NOTE: the actual response was much larger) for a topics request with startDate
as 2021-07-06
and endDate
as 2021-07-07
:
[
{
"time": "twitter-2021-7-6",
"topics": [
{
"size": 13426,
"terms": [
[
"time",
0.012539179088579358
],
[
"thing",
0.011031433589385515
],
[
"thank",
0.010893080954586857
],
[
"day",
0.01079276048562502
],
[
"way",
0.009899755014715397
],
[
"peopl",
0.009862197831996721
],
[
"year",
0.009700065374291758
],
[
"work",
0.009599968190570853
],
[
"pleas",
0.009439287725906211
],
[
"love",
0.00923899830888478
],
[
"someth",
0.008347818972842945
],
[
"famili",
0.007857277204932738
],
[
"man",
0.007826952365885256
],
[
"life",
0.007707197935138655
],
[
"look",
0.00760704576238854
],
[
"number",
0.00748684770119196
],
[
"hope",
0.007459941187635528
],
[
"chang",
0.0073872271226581845
],
[
"lol",
0.007273909313434026
],
[
"world",
0.007218640153030318
],
[
"today",
0.0070560603622170735
],
[
"lot",
0.007020157362037514
],
[
"stori",
0.0067819740134623194
],
[
"point",
0.006586016841722677
],
[
"someon",
0.006336134860858228
],
[
"amp",
0.006335191555513297
],
[
"person",
0.006326825382022919
],
[
"mate",
0.0063065568290590086
],
[
"idea",
0.00630524951478842
],
[
"issu",
0.006262870208386203
]
]
},
{
"size": 1184,
"terms": [
[
"been",
0.6629800970648556
],
[
"and",
0.6447700257307555
],
[
"tood",
0.5790615196460484
],
[
"wbk",
0.5790615196460484
],
[
"ofoot",
0.5790615196460484
],
[
"lme",
0.5790615196460484
],
[
"lulz",
0.5058207004015077
],
[
"aint",
0.4717656588171083
],
[
"edit",
0.3298835450938238
],
[
"note",
0.30500326203449424
],
[
"new",
0.24546509924968438
],
[
"good",
0.23786193215159798
],
[
"other",
0.2128574234233451
]
]
},
{
"size": 854,
"terms": [
[
"women",
0.02896990327024129
],
[
"peopl",
0.01671489963171787
],
[
"labor",
0.015499202570304927
],
[
"lnp",
0.014917234756772927
],
[
"govern",
0.014581957505262429
],
[
"men",
0.01409398595674428
],
[
"parti",
0.013829154562138842
],
[
"vote",
0.012737353913619187
],
[
"woman",
0.012643179591426178
],
[
"thing",
0.012518401784242076
],
[
"elect",
0.012349628904950012
],
[
"amp",
0.01229898149785756
],
[
"countri",
0.011066587380819054
],
[
"noth",
0.01101237688074447
],
[
"time",
0.010930784687508228
],
[
"someon",
0.010913677044621818
],
[
"way",
0.010883379274568714
],
[
"person",
0.01087435730958338
],
[
"book",
0.010083183625184377
],
[
"point",
0.009814435710351012
],
[
"sex",
0.009678021925846197
],
[
"media",
0.009644183199768645
],
[
"polici",
0.00942604534487092
],
[
"word",
0.009384505091836968
],
[
"job",
0.00929449252100335
],
[
"world",
0.008890181955987965
],
[
"right",
0.008671508417990654
],
[
"leader",
0.008594574935355337
],
[
"power",
0.00858276775345682
],
[
"minist",
0.008506955957547122
]
]
}
]
},
{
"time": "twitter-2021-7-7",
"topics": [
{
"size": 13625,
"terms": [
[
"peopl",
0.04954222129345652
],
[
"time",
0.03852017067849832
],
[
"thank",
0.036208638709126315
],
[
"year",
0.03078951789567827
],
[
"day",
0.029598754995594615
],
[
"thing",
0.028388628668967766
],
[
"way",
0.025430383221641064
],
[
"amp",
0.023461890502198015
],
[
"vaccin",
0.022571555371053823
],
[
"game",
0.02133101156090405
],
[
"work",
0.020237805815050747
],
[
"someth",
0.016794161708378788
],
[
"man",
0.015534358266290738
],
[
"world",
0.015384118574431803
],
[
"someon",
0.015120125645947224
],
[
"school",
0.015006561551544784
],
[
"look",
0.014854740985397878
],
[
"point",
0.01470245590150101
],
[
"love",
0.0145879336603674
],
[
"life",
0.0145879336603674
],
[
"week",
0.014473144192518831
],
[
"everyon",
0.014049920157173034
],
[
"govern",
0.013700852268937852
],
[
"lol",
0.013661907987933455
],
[
"noth",
0.013622931506343448
],
[
"lot",
0.013309946606863058
],
[
"anyth",
0.013270675101068469
],
[
"women",
0.013231370219479292
],
[
"hope",
0.013152659838993398
],
[
"home",
0.013034340863974179
]
]
},
{
"size": 851,
"terms": [
[
"just",
1.0704923609828432
],
[
"second",
1.0102780475251398
],
[
"forgot",
0.6540451718630429
],
[
"twice",
0.6199901302786435
],
[
"dam",
0.5975586478372293
],
[
"shouldn",
0.5808043526185023
],
[
"more",
0.5674263062543589
],
[
"mon",
0.5674263062543589
],
[
"been",
0.5154157357510538
],
[
"ill",
0.4865581302646353
],
[
"not",
0.43945345020516285
],
[
"other",
0.3852927434460453
]
]
}
]
}
]
Below is an abridged sample response (NOTE: the actual response was much larger) for a topics request with startDate
as 2021-07-06
and endDate
as 2021-07-07
, with fullResults
set to true
:
Topic Posts
The social media posts/documents assigned to a particular topic cluster can also be retrieved at the following endpoint:
POST /analysis/nlp/collections/{collection}/topicposts
where collection
is a path parameter, of the target social media collection. There are no query parameters for this endpoint. However, the request body has to be an array of topics ids (as strings) in the format ‘yyyymmdd-t’ (where tt is the topic number for the day specified in yyyymmdd ), such as:
[
'20210922-1', '20210923-4', '20210924-5'
]
and 'Content-Type:application/json'
has to be passed into the request headers.
Examples
cURL
curl -XPOST 'https://api.ado.eresearch.unimelb.edu.au//analysis/nlp/collections/twitter/topicposts'\
--data '["20211010-1", "20211008-2", "20211008-1"]'\
--header 'Content-Type:application/json'\
--header "Authorization: Bearer ${JWT}"
Python
import requests
url = 'https://api.ado.eresearch.unimelb.edu.au/analysis/nlp/collections/twitter/topicposts'
# query parameters set in the dict below
headers = {'Authorization': f"Bearer {jwt}", 'Content-Type': 'application/json'}
data = ["20211010-1", "20211008-2", "20211008-1"]
res = requests.post(url, headers = headers, json=data)
result = res.json()
Sample responses
Below is an abridged response from a topicposts
request for the clusters with ids "20211010-1"
, "20211008-2"
, "20211008-1"
(clusters on separate days)
curl -XPOST 'https://api.ado.eresearch.unimelb.edu.au//analysis/nlp/collections/twitter/topicposts'\
--data '["20211010-1", "20211008-2", "20211008-1"]'\
--header 'Content-Type:application/json'\
--header "Authorization: Bearer ${JWT}"
[
"1204822986919952384",
"1265315022663753728",
"1293965389089722369",
"1316470333747101701",
"1345038179070476289",
"1350585250506805248",
"1354391784479019008",
"1363294990395301890",
"1374312193785614337",
"1374420691743739909",
"1404490430214029315",
"1406973122193002504",
"1408094835203002369",
"1411106951111622664",
"1413489152708993027",
"1414266348306370568",
"1414309380229672973",
"1418268245493157892",
"1418588497414340610",
"1418614308049723396",
"1423995484469927941",
"1425970355793895427",
"1426007838242050054",
"1429482874500177928",
"1431703502078812168",
"1431741358411173889",
"1432438057274327056",
"1432668449780731906",
"1433138299435167747",
"1433658094538698767",
"1435668001009946624",
"1436039149790892034",
"1438050872068562945",
"1438193702686527490",
"1438680701696679939",
"1438793907207163908",
"1439684661551280128",
"1440300321624911880",
"1440346508780470281"
]
Topic Groupings
A network graph was chosen as the data structure to capture the relationships between topic clusters on consecutive days. In brief, the makeup of the network graph is the following:
• A node is a topic cluster on a given day. These are qualified via the size and similarity of the topic clusters, and would only be in the graph if were above a defined threshold.
• An is was an intersection between two nodes on consecutive days. It would only exist if the number of intersecting terms was above the threshold specified to the function via the API. The higher the threshold, the more sparse the network graph would be.
The network graph can be retrieved at the following endpoint
GET /analysis/nlp/collections/{collection}/topicgroupings
where collection
is a path parameter, of the target social media collection. The query parameters for the endpoint are:
Parameter | type | Description | Required | Example |
---|---|---|---|---|
startDate | string (date-time) | Start date for data requested | Yes | 2021-09-16 |
endDate | string (date-time) | End date for data requested | Yes | 2021-09-21 |
threshold | integer | minimum number of common terms to define a grouping between two clusters | Yes | 15 |
Examples
cURL
curl -XGET 'https://api.ado.eresearch.unimelb.edu.au//analysis/nlp/collections/twitter/topicgroupings?startdate=2021-07-10&enddate=2021-07-20&threshold=12'\
--header "Authorization: Bearer ${JWT}"
Python
import requests
url = 'https://api.ado.eresearch.unimelb.edu.au//analysis/nlp/collections/twitter/topicgroupings'
# query parameters set in the dict below
qs_params = { 'startDate' : '2021-07-10' , 'endDate':'2021-07-20', 'threshold' : 12 }
headers = {'Authorization': f"Bearer {jwt}"}
res = requests.get(url, headers = headers, params=qs_params)
result = res.json()
Sample responses
Below is a sample response for a topicgroupings
request, with startDate
as 2021-07-10
and endDate
as 2021-07-20
, and threshold
as 12:
curl -XGET 'https://api.ado.eresearch.unimelb.edu.au//analysis/nlp/collections/twitter/topicgroupings?startdate=2021-07-10&enddate=2021-07-20&threshold=12'\
--header "Authorization: Bearer ${JWT}"
{
"directed": true,
"graph": {},
"links" : [
{
"common_terms": [
"band",
"concert",
"track",
"artist",
"audio",
"sound",
"tune",
"singer",
"lyric",
"album",
"music",
"rock",
"listen",
"feat",
"song",
"metal",
"guitar",
"soundtrack",
"vocal",
"remix",
"danc",
"courtesi",
"ti",
"sing",
"playlist"
],
"sequences": [
1
],
"source": "20210710-1",
"target": "20210711-1",
"weight": 25
}
],
"multigraph": false,
"nodes" : [
{
"day": 0,
"id": "20210710-1",
"sequences": [
1
],
"size": 2257,
"terms": [
"song",
"music",
"album",
"sound",
"audio",
"band",
"playlist",
"tune",
"rock",
"lyric",
"theme",
"guitar",
"listen",
"remix",
"soundtrack",
"feat",
"track",
"sing",
"artist",
"metal",
"concert",
"vocal",
"singer",
"courtesi",
"punk",
"version",
"danc",
"piano",
"ti",
"give"
]
},
{
"day": 1,
"id": "20210711-1",
"sequences": [
1
],
"size": 1940,
"terms": [
"song",
"music",
"album",
"playlist",
"sound",
"audio",
"tune",
"band",
"rock",
"radio",
"soundtrack",
"listen",
"guitar",
"jam",
"lyric",
"singer",
"track",
"remix",
"feat",
"ti",
"concert",
"drum",
"artist",
"microphon",
"metal",
"vocal",
"danc",
"sing",
"app",
"courtesi"
]
},
]
}
Topic modelling metadata
To retrieve the model parameters, corpus, and processing time of the topic modelling performed on the cluster, the following endpoint can be queried.
GET /analysis/nlp/collections/{collection}/metadata
where collection
is a path parameter, of the target social media collection. The query parameters for the endpoint are:
Parameter | type | Description | Required | Example |
---|---|---|---|---|
startDate | string (date-time) | Start date for data requested | Yes | 2021-09-16 |
endDate | string (date-time) | End date for data requested | Yes | 2021-09-21 |
Examples
cURL
curl -XGET 'https://api.ado.eresearch.unimelb.edu.au//analysis/nlp/collections/twitter/metadata?startdate=2021-07-13&enddate=2021-07-31'\
--header "Authorization: Bearer ${JWT}"
Python
import requests
url = 'https://api.ado.eresearch.unimelb.edu.au//analysis/nlp/collections/twitter/metadata'
# query parameters set in the dict below
qs_params = { 'startDate' : '2021-07-10' , 'endDate':'2021-07-20'}
headers = {'Authorization': f"Bearer {jwt}"}
res = requests.get(url, headers = headers, params=qs_params)
result = res.json()
Sample response
curl -XGET 'https://api.ado.eresearch.unimelb.edu.au//analysis/nlp/collections/twitter/metadata?startdate=2021-07-13&enddate=2021-07-15'\
--header "Authorization: Bearer ${JWT}"
[
{
"id": "twitter-20210713",
"corpus": {
"start": "2021-07-13T00:00:00+11:00",
"end": "2021-07-13T23:59:59+11:00",
"days": 1,
"ndocuments": 237902
},
"processing": {
"start": "2021-11-25T17:32:53+11:00",
"end": "2021-11-25T17:57:44+11:00",
"procminutes": 24,
"corpussize": 142944,
"dictionarysize": 42127,
"parameters": {
"w2v": {
"wv_vector_size": 100,
"wv_min_count": 100,
"wv_window_size": 5,
"wv_training_algorithm": 0
},
"clustering": {
"bert_model": "w2v",
"bert_top_n_words": 30,
"bert_min_topic_size": 100
}
}
}
},
{
"id": "twitter-20210714",
"corpus": {
"start": "2021-07-14T00:00:00+11:00",
"end": "2021-07-14T23:59:59+11:00",
"days": 1,
"ndocuments": 252765
},
"processing": {
"start": "2021-11-25T17:32:52+11:00",
"end": "2021-11-25T18:02:45+11:00",
"procminutes": 29,
"corpussize": 150183,
"dictionarysize": 43002,
"parameters": {
"w2v": {
"wv_vector_size": 100,
"wv_min_count": 100,
"wv_window_size": 5,
"wv_training_algorithm": 0
},
"clustering": {
"bert_model": "w2v",
"bert_top_n_words": 30,
"bert_min_topic_size": 100
}
}
}
},
{
"id": "twitter-20210715",
"corpus": {
"start": "2021-07-15T00:00:00+11:00",
"end": "2021-07-15T23:59:59+11:00",
"days": 1,
"ndocuments": 227384
},
"processing": {
"start": "2021-11-25T17:32:54+11:00",
"end": "2021-11-25T18:02:02+11:00",
"procminutes": 29,
"corpussize": 135061,
"dictionarysize": 39341,
"parameters": {
"w2v": {
"wv_vector_size": 100,
"wv_min_count": 100,
"wv_window_size": 5,
"wv_training_algorithm": 0
},
"clustering": {
"bert_model": "w2v",
"bert_top_n_words": 30,
"bert_min_topic_size": 100
}
}
}
}
]
Text Search
The text search query follows the Lucene syntax . The fields available are:
- text (tokenized text of the social media post)
- hashtags (text of hashtags)
- author (author’s id)
- date (date of posting expressed as YYYYMMDD, in UTC)
- language (the language the post is expressed in)
For instance, a valid expression could be: hashtags:booksthatmademe AND author:"22250517" AND text:helped AND language:"en" AND date:"20210718"
GET /analysis/textsearch/collections/{collection}
where ‘collection’ is a path parameter of the target social media collection. The query parameters for the function are:
Parameter | type | Description | Required | Example |
---|---|---|---|---|
query | string | Full-text query | Yes | climate* |
The output is returned in pages of 200 IDs each. The pagination is managed via the x-ado-bookmark
header: the header is returned as a response header by each request, and has to be added to the next request to get the subsequent page until an empty ID array is returned.
The total number of rows that has to be returned by the query is contained in the x-ado-totalrows
response header.
Examples
cURL (command-line)
curl -XGET 'https://api.ado.eresearch.unimelb.edu.au/analysis/textsearch/collections/twitter?query=hashtags:climate*'\
--header "Authorization: Bearer ${JWT}"
Python
import requests
url = 'https://api.ado.eresearch.unimelb.edu.au/analysis/textsearch/collections/twitter'
# query parameters set in the dict below
qs_params = { 'query' : 'hashtags:climate*' }
headers = {'Authorization': f"Bearer {jwt}"}
res = requests.get(url, headers = headers, params=qs_params)
result = res.json()
Sample Response
[
"1440864320926011394",
"1505702676994015232"
]
Pagination (Python)
url = 'https://api.ado.eresearch.unimelb.edu.au/analysis/textsearch/collections/twitter'
qs_params = { 'query' : 'hashtags:climate*' }
headers = {'Authorization': f"Bearer {jwt}"}
res = requests.get(url, headers = headers, params=qs_params)
result1 = res.json()
url = 'https://api.ado.eresearch.unimelb.edu.au/analysis/textsearch/collections/twitter'
qs_params = { 'query' : 'hashtags:climate*' }
headers = {'Authorization': f"Bearer {jwt}", 'x-ado-bookmark':res.headers['x-ado-bookmark']}
res = requests.get(url, headers = headers, params=qs_params)
result2 = res.json()